Skip to content

Repository of scripts used to convert dataservice metadata and DRC processed data into cbio portal loading format.

License

Notifications You must be signed in to change notification settings

kids-first/kf-cbioportal-etl

Repository files navigation

Outline on ETL for converting data from CAVATICA and Data Warehouse to PedcBioPortal format

In general, we are creating upload packages converting our data and metadata to satisfy the requirements outlined here. See below for special cases like publications or collaborative efforts

Overview

usage: cbio-etl [-h] {import,update} ...

CBio ETL Command Line Tool

positional arguments:
  {import,update}
    import         Run import workflow (Steps 1, 2, 4, 5, 6)
    update         Run update workflow (Steps 1, 2, 3, 4, 5, 6)

options:
  -h, --help       show this help message and exit

The steps in cbio-etl import are outlined as follows:

  1. Generate config JSON
  2. Get study metadata
  3. Get files from manifest
  4. Check downloaded files
  5. Build genomic file package Pipeline Flowchart

Required credentials files

  • Copy the db.ini template to /path/to/db.ini and replace placeholders with your credentials.
  • Copy the sevenbridges.ini template to ~/.sevenbridges/credentials and replace placeholders with your credentials.

Required for running cbio-etl update

  • Download a reusable access token for PedcBioPortal cbioportal_data_access_token.txt from here.

Local Installation

Software Prerequisites

cBio load package v5.4.10 is used in this tool. Refer to INSTALL.md and setup.py for more details.

Installation Steps

Standard references and configs are stored in a separate git repo. Be sure to grab the desired release before pip install as demonstrated below Run on Mgmt-Console-Dev-chopd3bprod@684194535433 EC2 instance

git clone https://github.com/kids-first/kf-cbioportal-etl.git
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*REFS/*' --strip-components=1
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*STUDY_CONFIGS/*' --strip-components=1
pip install /path/to/kf-cbioportal-etl/

Usage

cbio-etl import \
    --db-ini /path/to/db.ini \
    --study pbta_pnoc \
    --sbg-profile default \
    --dgd-status kf 

Docker Installation

We've created a Docker image that allows you to dynamically choose which version of cbio-etl to use at runtime.

Installation Steps

docker pull pgc-images.sbgenomics.com/d3b-bixu/cbio-etl:v2.4.1

Runtime and Usage

docker run --rm -it \
    -v /path/to/db.ini:/credentials/db.ini \
    -v /path/to/cbioportal_data_access_token.txt:/credentials/cbioportal_data_access_token.txt \
    -v /path/to/.sevenbridges/credentials:/root/.sevenbridges/credentials \
    -v /path/to/output_dir:/output \
    cbio-etl-runtime-env /bin/bash
    
# If you want to change the refs used, run the code snippet below, otherwise skip
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*REFS/*' --strip-components=1
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*STUDY_CONFIGS/*' --strip-components=1
pip install /path/to/kf-cbioportal-etl/
# Run cbio-etl
cd /output/
cbio-etl update \
    --db-ini /credentials/db.ini \
    --token /credentials/cbioportal_data_access_token.txt \
    --study pbta_pnoc \
    --sbg-profile default \
    --dgd-status kf

Collaborative and Publication Workflows

These are highly specialized cases in which all or most of the data come from a third party, and therefore requires specific case-by-case protocols.

OpenPedCan

See OpenPedCan README

OpenPBTA

See OpenPBTA README

About

Repository of scripts used to convert dataservice metadata and DRC processed data into cbio portal loading format.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6