Seqr Loading Pipeline

This repository contains pipelines and infrastructure for loading genomic data from VCF -> ClickHouse to support queries by the seqr application

📁 Repository Structure

`api/`

Contains the interface layer to the seqr application.

api/model.py defines pydantic models for the REST interface.
api/app.py specifies an aiohttp webserver that handles load data requests.

`bin/`

Scripts or command-line utilities used for setup or task execution.

bin/pipeline_worker.py — manages asynchronous jobs requested by seqr.

`deploy/`

Dockerfiles for the loading pipeline itself & any annotation utilities. Kubernetes manifests are managed separately in seqr-helm

`lib/`

Core logic and shared libraries.

annotations defines hail logic to re-format and standardize fields.
methods wraps hail-defined genomics methods for QC.
misc contains single modules with defined utilities.
- misc/clickhouse hosts the logic that manages the parquet ingestion into ClickHouse itself.
core defines key constants/enums/config.
reference_datasets manages parsing of raw reference sources into hail tables.
tasks specifies the Luigi defined pipeline. Note that Luigi pipelines are defined by their requirements, so the pipeline is defined, effectively, in reverse.
- WriteSuccessFile is the last task, defining a requires() method that runs the pipeline either locally or on scalable compute.
- WriteImportedCallset is the first task, importing a VCF into a Hail Matrix table, an "imported callset".
test holds a few utilities used by the tests, which are dispersed throughout the rest of the repository.
paths.py defines paths for all intermediate and output files of the pipeline.

`var/`

Static configuration and test files.

⚙️ Setup for Local Development

The production pipeline runs with python 3.11.

Clone the repo and install python requirements

git clone https://github.com/broadinstitute/seqr-loading-pipelines.git
cd seqr-loading-pipelines
pip install -r requirements.txt
pip install -r requirements-dev.txt

Install & start ClickHouse with provided test configuration:

curl https://clickhouse.com/ | sh
./clickhouse server --config-file=./seqr-loading-pipelines/v03_pipeline/var/clickhouse_config/test-clickhouse.xml

Run the Tests

Run an Individual Test

nosetests v03_pipeline/lib/misc/math_test.py

Formatting and Linting

ruff format .
ruff check .

🚪 Schema Entrypoints

The expected fields and types are defined in dataset_type.py as the col_fields, entry_fields, and row_fields properties. Examples of the SNV_INDEL/MITO/SV/GCNV callset schemas may be found in the tests.
The VEP schema is defined in JSON within the vep*.json config files, then parsed into hail in lib/annotations/vep.py.
Examples of exported parquets may be found in lib/tasks/exports/*_parquet_test.py

🚶‍♂️ ClickHouse Loader Walkthrough

The Clickhouse Loader follows the pattern established in the Making a Large Data Load Resilient blog
- Rows are first loaded into a staging database that copies the production TABLEs and MATERIALIZED VIEWS.
- After all entries are inserted, we validate the inserted row count and finalize the per-project allele frequency aggregation.
- Partitions are atomically moved from the staging environment to production.

Name		Name	Last commit message	Last commit date
Latest commit History 3,590 Commits
.cloudbuild		.cloudbuild
.github/workflows		.github/workflows
docker		docker
gcloud_dataproc		gcloud_dataproc
hail_builds		hail_builds
hail_scripts		hail_scripts
kubernetes		kubernetes
luigi_pipeline		luigi_pipeline
v03_pipeline		v03_pipeline
.dockerignore		.dockerignore
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dataproc.in		requirements-dataproc.in
requirements-dataproc.txt		requirements-dataproc.txt
requirements-dev.in		requirements-dev.in
requirements-dev.txt		requirements-dev.txt
requirements-prod.in		requirements-prod.in
requirements-prod.txt		requirements-prod.txt
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seqr Loading Pipeline

📁 Repository Structure

`api/`

`bin/`

`deploy/`

`lib/`

`var/`

⚙️ Setup for Local Development

Clone the repo and install python requirements

Install & start ClickHouse with provided test configuration:

Run the Tests

Run an Individual Test

Formatting and Linting

🚪 Schema Entrypoints

🚶‍♂️ ClickHouse Loader Walkthrough

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 23

Languages

License

broadinstitute/seqr-loading-pipelines

Folders and files

Latest commit

History

Repository files navigation

Seqr Loading Pipeline

📁 Repository Structure

api/

bin/

deploy/

lib/

var/

⚙️ Setup for Local Development

Clone the repo and install python requirements

Install & start ClickHouse with provided test configuration:

Run the Tests

Run an Individual Test

Formatting and Linting

🚪 Schema Entrypoints

🚶‍♂️ ClickHouse Loader Walkthrough

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 23

Languages

`api/`

`bin/`

`deploy/`

`lib/`

`var/`

Packages