Contributing guide
Development environment
This section provides a brief overview of the minimal developer environment. Because tsumugi
is a complex multi-language project, this part is split into subsections covering the JVM server component, the protobuf and buf component, and the Python client component.
Makefile
A portion of useful commands has been moved to the Makefile
. To run make
commands, one needs to download and install the make utility. The make utility can be obtained from the GNU Make website. For some operating systems, make
is also available through package managers.
JVM
For managing Java, Scala, and Maven, it is strongly recommended to use tools like SDKMan. With SDKMan, all the necessary components can be easily installed using just a few shell commands.
- Java 11; OpenJDK archieve, installation manual;
- Apache Maven; Installation manual;
- Scala 2.12.X; Installation manual;
Installation with SDKMan
:
Protobuf
Java classes are generated from the protobuf messages during the build via maven plugin and there is no reason to store them inside the repository. Python classes are generated from the protobuf messages manually. To generat or update Python classes run the following:
Python
It is strongly recommend to use tools like pyenv to manage Python versions.
- Python; with
pyenv
:pyenv install 3.10
; - Poetry; Installation manual;
Run the following command to create a python venv:
poetry env use %path-to-python3.10% # On POSIX systems it should be like ~/.pyenv/versions/3.10.14/bin/python
poetry install --with dev
Code style and naming conventions
Linters
Tsumugi uses scalafmt
for server part and ruff
for Python client.
To apply formatting rules to scala part run the following command:
To apply python formatting rules run the following command:
Python tips
All the APIs built on top of ptorobuf classes are done in the following way:
The protobuf message:
message KLLParameters {
int32 sketch_size = 1;
double shrinking_factor = 2;
int32 number_of_buckets = 3;
}
The Python API:
@dataclass
class KLLParameters:
"""Parameters for KLLSketch."""
sketch_size: int
shrinking_factor: float
number_of_buckets: int
def _to_proto(self) -> proto.KLLSketch.KLLParameters:
return proto.KLLSketch.KLLParameters(
sketch_size=self.sketch_size,
shrinking_factor=self.shrinking_factor,
number_of_buckets=self.number_of_buckets,
)
To maintain a consistent style across all Python code, the following rules should be followed:
- All classes that wrap code generated from protobuf messages should be implemented as dataclasses, unless there is a compelling reason not to do so.
- Each of these classes should have a private method
_to_proto(self) -> "protobuf class"
that converts the dataclass to the proto-serializable class.
Running examples or testing clients
To simplify testing and development, there is a script that builds a server plugin, downloads and unpacks the Spark distribution, and runs the Spark Connect Server with all the necessary configurations. To run it, use make run_spark_server
. After that, the server will be available at sc://localhost:15002
.
Examples and notebooks
Examples and Notebooks are part of the documentation and are placed in docs/notebooks
folder.