This repository contains pipelines and infrastructure for loading genomic data from VCF -> ClickHouse to support queries by the seqr application
Contains the interface layer to the seqr application.
api/model.pydefines pydantic models for the REST interface.api/app.pyspecifies anaiohttpwebserver that handles load data requests.
Scripts or command-line utilities used for setup or task execution.
bin/pipeline_worker.py— manages asynchronous jobs requested by seqr.
Dockerfiles for the loading pipeline itself & any annotation utilities. Kubernetes manifests are managed separately in seqr-helm
Core logic and shared libraries.
annotationsdefines hail logic to re-format and standardize fields.methodswraps hail-defined genomics methods for QC.misccontains single modules with defined utilities.misc/clickhousehosts the logic that manages the parquet ingestion into ClickHouse itself.
coredefines key constants/enums/config.reference_datasetsmanages parsing of raw reference sources into hail tables.tasksspecifies the Luigi defined pipeline. Note that Luigi pipelines are defined by their requirements, so the pipeline is defined, effectively, in reverse.WriteSuccessFileis the last task, defining arequires()method that runs the pipeline either locally or on scalable compute.WriteImportedCallsetis the first task, importing a VCF into a Hail Matrix table, an "imported callset".
testholds a few utilities used by the tests, which are dispersed throughout the rest of the repository.paths.pydefines paths for all intermediate and output files of the pipeline.
Static configuration and test files.
The production pipeline runs with python 3.11.
git clone https://github.com/broadinstitute/seqr-loading-pipelines.git
cd seqr-loading-pipelines
pip install -r requirements.txt
pip install -r requirements-dev.txtInstall & start ClickHouse with provided test configuration:
curl https://clickhouse.com/ | sh
./clickhouse server --config-file=./seqr-loading-pipelines/v03_pipeline/var/clickhouse_config/test-clickhouse.xmlnosetests v03_pipeline/lib/misc/math_test.pyruff format .
ruff check .- The expected fields and types are defined in
dataset_type.pyas thecol_fields,entry_fields, androw_fieldsproperties. Examples of the SNV_INDEL/MITO/SV/GCNV callset schemas may be found in the tests. - The VEP schema is defined in JSON within the vep*.json config files, then parsed into hail in
lib/annotations/vep.py. - Examples of exported parquets may be found in
lib/tasks/exports/*_parquet_test.py
- The Clickhouse Loader follows the pattern established in the Making a Large Data Load Resilient blog
- Rows are first loaded into a
stagingdatabase that copies the productionTABLEs andMATERIALIZED VIEWS. - After all
entriesare inserted, we validate the inserted row count and finalize the per-project allele frequency aggregation. - Partitions are atomically moved from the
stagingenvironment to production.
- Rows are first loaded into a