GitHub - Ferlab-Ste-Justine/etl-cqdg-portal

ETL (Clinical && Genomic)

ETL that extracts clinical and genomic data for the CQDG data portal.

Run the ETL via StepFunctions

The Clinical and Genomic ETLs can now be started via DAGs in Airflow. When the ETL finishes a status will be reported to cqdg-airflow channel. Each step can be run separately for both clinical data or genomic data.

Clinical ETL

DAG etl (Runs all steps for clinical ETL: etl-fhir-import => etl-import => etl-prepare-index => etl-index => etl-publish)
DAG etl-fhir-import (Runs step fhavro-export for clinical ETL)
DAG etl-import (Runs step index-task for clinical ETL)
DAG etl-prepare-index (Runs step prepare-index for clinical ETL)
DAG etl-index (Runs step index-task for clinical ETL)
DAG etl-publish (Runs step publish-task for clinical ETL)

Example Input JSON for etl DAG:

{
  "es_port":"9200",
  "project":"cqdg",
  "release_id":"14",
  "study_codes":"study1,study2,study3,T-DEE"
}

release_id (required) - Release ID passed to Spark Jobs
study_codes (required) - List of Study code to run Portal ETL against
es_port (required) - Port of Elasticsearch cluster, should be 9200
project - project, should be cqdg

Genomic ETL

DAG etl-variant (Runs all steps for clinical ETL: enrich-specimen => etl-normalize-variants => etl-enrich-variants => etl-prepare-index-variant => etl-index)
DAG enrich-specimen (Runs step enrich-specimen for genomic ETL, enriching the current specimen tables)
DAG etl-normalize-variants (Runs step normalize for snv and for consequences)
DAG etl-enrich-variants (Runs step enrich for variants and for consequences)
DAG etl-prepare-index-variant (Runs step prepare index for variant_centric, gene_centric, variant_suggestions, gene_suggestions)
DAG etl-index (Runs step index-task for genomic ETL)
DAG etl-publish-variant (Runs step publish variant_centric, gene_centric, variant_suggestions, gene_suggestions)

{
  "study_code": "XYZ",
  "owner": "person1",
  "dateset_batches": [
    {
      "batches": [
        "annotated_vcf1",
        "annotated_vcf2"
      ],
      "dataset": "dataset_dataset1"
    },
    {
      "batches": [
        "annotated_vcf"
      ],
      "dataset": "dataset_dataset2"
    }
  ],
  "release_id": 1,
  "project": "cqdg",
  "es_port": 9200
}

release_id (required) - Release ID passed to Spark Jobs
study_code (required) - study_code
dateset_batches (required) - Batches to run
es_port (required) - Port of Elasticsearch cluster, should be 9200
project - project, should be cqdg

Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
.github/workflows		.github/workflows
config		config
docs		docs
fhavro-export		fhavro-export
import-task		import-task
index-task		index-task
prepare-index		prepare-index
project		project
publish-task		publish-task
variant-task		variant-task
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL (Clinical && Genomic)

Run the ETL via StepFunctions

Clinical ETL

Genomic ETL

About

Uh oh!

Releases 113

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

Ferlab-Ste-Justine/etl-cqdg-portal

Folders and files

Latest commit

History

Repository files navigation

ETL (Clinical && Genomic)

Run the ETL via StepFunctions

Clinical ETL

Genomic ETL

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 113

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages