We welcome contributions to Sycamore.
To avoid duplicate or unnecessary work, please follow this process to make contributions:
- Find or create an issue in GitHub. Leave a note on an existing issue to let the community know you are working on it, or create a new issue for a feature you would like to add. If there are significant design decisions to cover, please start a conversation in the issue to get feedback early.
- Create a fork of Sycamore and start development in a feature branch. You can find information about how to set up Sycamore for development below. More information about development patterns in Github can be found here.
- Make sure linting rules and all existing unit tests pass. Consider whether your changes need new tests, and make sure those pass as well.
- Larger features may need new documentation. When in doubt, start a discussion on the associated issue.
- Create a pull request against the
main
branch of this repository. Pull requests are lightweight, and we encourage you to create a draft PR early to get feedback.
Sycamore is a new project with big ambitions, so there are a lot of ways to contribute:
- Bug reports. We love bug reports! If you find something broken, please open an issue using the Bug report template and provide as much detail as possible about what happened and how we can reproduce it.
- Documentation improvements. We want it to be as easy as possible to get started with sycamore. If you spot something wrong or something that could be clearer in the docs, please open an issue, or better yet contribute a fix yourself!
- New sources and file formats. Sycamore can be used to process all types of unstructured data. If there is a format or source that we're missing, consider adding it and contributing back.
- New models and LLMs. Similarly, we want to support a wide variety of machine learning models and LLMs for different purposes, such as segmentation, embedding, and entity extraction.
- New transformations. Transformations are the core of Sycamore. If you have an idea for a useful transformation, create an issue and get started!
These are just a few ideas. For more, check out our list of issues.
Sycamore targets Python 3.9+ and runs primarily on Mac and Linux. The following will help you set up your environment.
We use poetry to manage Python dependencies in Sycamore. You can install poetry using the instructions here. Once you have poetry installed, you can install all dependencies by running
poetry install --all-extras
In addition, some pdf processing methods require Poppler. You can install this with the OS-native package manager of your choice. For example, the command for Homebrew on Mac OS is
brew install poppler
We use ruff
to lint sycamore and black
to automatically format our code. You can run these tools from the root of the repository using
poetry run ruff check .
poetry run black .
Note that black
will reformat code in place. Make sure to include these changes in your commits.
We use type annotations in Sycamore and use mypy
to perform static type-checking. You can run this using
poetry run mypy --install-types .
Currently mypy
will fail with the --strict
flag, but it should pass cleanly without.
Our tests are implemented using PyTest. You can run unit and integration tests as follows:
# Unit Tests
poetry run pytest lib/sycamore/sycamore/tests/unit/
# Integration Tests
# Warning: as of 2024-05-17 the integration tests are currently broken. We are working on fixing them.
# poetry run pytest lib/sycamore/sycamore/tests/integration/
# All Tests
poetry run pytest
Note that the integration tests currently require the OPENAI_API_KEY
environment variable to be set to a valid OpenAI key.
Sycamore includes a pre-commit configuration that will automatically run linting, formatting, and type-checking, as well the unit tests. You can perform a one-time run of these checks using
poetry run pre-commit run --all-files
or install it to run on git commits using
poetry run pre-commit install
Similar checks run in our CI environment on new pull requests and check-ins. New features must pass the mandatory checks before they will be merged.
Release branches will be cut from the main branch following semantic versioning. The release process is currently manual, though we plan to automate it over time.