Skip to content

broadinstitute/seqr-loading-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seqr Loading Pipeline

Code Build Docker Build

This repository contains pipelines and infrastructure for loading genomic data from VCF -> ClickHouse to support queries by the seqr application


📁 Repository Structure

api/

Contains the interface layer to the seqr application.

  • api/model.py defines pydantic models for the REST interface.
  • api/app.py specifies an aiohttp webserver that handles load data requests.

bin/

Scripts or command-line utilities used for setup or task execution.

  • bin/pipeline_worker.py — manages asynchronous jobs requested by seqr.

deploy/

Dockerfiles for the loading pipeline itself & any annotation utilities. Kubernetes manifests are managed separately in seqr-helm

lib/

Core logic and shared libraries.

  • annotations defines hail logic to re-format and standardize fields.
  • methods wraps hail-defined genomics methods for QC.
  • misc contains single modules with defined utilities.
    • misc/clickhouse hosts the logic that manages the parquet ingestion into ClickHouse itself.
  • core defines key constants/enums/config.
  • reference_datasets manages parsing of raw reference sources into hail tables.
  • tasks specifies the Luigi defined pipeline. Note that Luigi pipelines are defined by their requirements, so the pipeline is defined, effectively, in reverse.
    • WriteSuccessFile is the last task, defining a requires() method that runs the pipeline either locally or on scalable compute.
    • WriteImportedCallset is the first task, importing a VCF into a Hail Matrix table, an "imported callset".
  • test holds a few utilities used by the tests, which are dispersed throughout the rest of the repository.
  • paths.py defines paths for all intermediate and output files of the pipeline.

var/

Static configuration and test files.


⚙️ Setup for Local Development

The production pipeline runs with python 3.11.

Clone the repo and install python requirements

git clone https://github.com/broadinstitute/seqr-loading-pipelines.git
cd seqr-loading-pipelines
pip install -r requirements.txt
pip install -r requirements-dev.txt

Install & start ClickHouse with provided test configuration:

curl https://clickhouse.com/ | sh
./clickhouse server --config-file=./seqr-loading-pipelines/v03_pipeline/var/clickhouse_config/test-clickhouse.xml

Run an Individual Test

nosetests v03_pipeline/lib/misc/math_test.py

Formatting and Linting

ruff format .
ruff check .

🚪 Schema Entrypoints

  • The expected fields and types are defined in dataset_type.py as the col_fields, entry_fields, and row_fields properties. Examples of the SNV_INDEL/MITO/SV/GCNV callset schemas may be found in the tests.
  • The VEP schema is defined in JSON within the vep*.json config files, then parsed into hail in lib/annotations/vep.py.
  • Examples of exported parquets may be found in lib/tasks/exports/*_parquet_test.py

🚶‍♂️ ClickHouse Loader Walkthrough

  • The Clickhouse Loader follows the pattern established in the Making a Large Data Load Resilient blog
    • Rows are first loaded into a staging database that copies the production TABLEs and MATERIALIZED VIEWS.
    • After all entries are inserted, we validate the inserted row count and finalize the per-project allele frequency aggregation.
    • Partitions are atomically moved from the staging environment to production.

About

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 23