This repository contains pipelines and infrastructure for loading genomic data from VCF -> ClickHouse to support queries by the seqr application
Contains the interface layer to the seqr application.
api/model.py
defines pydantic models for the REST interface.api/app.py
specifies anaiohttp
webserver that handles load data requests.
Scripts or command-line utilities used for setup or task execution.
bin/pipeline_worker.py
— manages asynchronous jobs requested by seqr.
Dockerfiles for the loading pipeline itself & any annotation utilities. Kubernetes manifests are managed separately in seqr-helm
Core logic and shared libraries.
annotations
defines hail logic to re-format and standardize fields.methods
wraps hail-defined genomics methods for QC.misc
contains single modules with defined utilities.misc/clickhouse
hosts the logic that manages the parquet ingestion into ClickHouse itself.
core
defines key constants/enums/config.reference_datasets
manages parsing of raw reference sources into hail tables.tasks
specifies the Luigi defined pipeline. Note that Luigi pipelines are defined by their requirements, so the pipeline is defined, effectively, in reverse.WriteSuccessFile
is the last task, defining arequires()
method that runs the pipeline either locally or on scalable compute.WriteImportedCallset
is the first task, importing a VCF into a Hail Matrix table, an "imported callset".
test
holds a few utilities used by the tests, which are dispersed throughout the rest of the repository.paths.py
defines paths for all intermediate and output files of the pipeline.
Static configuration and test files.
The production pipeline runs with python 3.11
.
git clone https://github.com/broadinstitute/seqr-loading-pipelines.git
cd seqr-loading-pipelines
pip install -r requirements.txt
pip install -r requirements-dev.txt
Install & start ClickHouse with provided test configuration:
curl https://clickhouse.com/ | sh
./clickhouse server --config-file=./seqr-loading-pipelines/v03_pipeline/var/clickhouse_config/test-clickhouse.xml
nosetests v03_pipeline/lib/misc/math_test.py
ruff format .
ruff check .
- The expected fields and types are defined in
dataset_type.py
as thecol_fields
,entry_fields
, androw_fields
properties. Examples of the SNV_INDEL/MITO/SV/GCNV callset schemas may be found in the tests. - The VEP schema is defined in JSON within the vep*.json config files, then parsed into hail in
lib/annotations/vep.py
. - Examples of exported parquets may be found in
lib/tasks/exports/*_parquet_test.py
- The Clickhouse Loader follows the pattern established in the Making a Large Data Load Resilient blog
- Rows are first loaded into a
staging
database that copies the productionTABLE
s andMATERIALIZED VIEWS
. - After all
entries
are inserted, we validate the inserted row count and finalize the per-project allele frequency aggregation. - Partitions are atomically moved from the
staging
environment to production.
- Rows are first loaded into a