Outline on ETL for converting data from CAVATICA and Data Warehouse to PedcBioPortal format

In general, we are creating upload packages converting our data and metadata to satisfy the requirements outlined here. See below for special cases like publications or collaborative efforts

Overview

usage: cbio-etl [-h] {import,update} ...

CBio ETL Command Line Tool

positional arguments:
  {import,update}
    import         Run import workflow (Steps 1, 2, 4, 5, 6)
    update         Run update workflow (Steps 1, 2, 3, 4, 5, 6)

options:
  -h, --help       show this help message and exit

Use cbio-etl import if importing a new/whole study. Read workflow details here
Use cbio-etl update if making changes to existing study (incremental updates). Read workflow details here

The steps in cbio-etl import are outlined as follows:

Generate config JSON
Get study metadata
Get files from manifest
Check downloaded files
Build genomic file package

Required credentials files

Copy the db.ini template to /path/to/db.ini and replace placeholders with your credentials.
Copy the sevenbridges.ini template to ~/.sevenbridges/credentials and replace placeholders with your credentials.

Required for running `cbio-etl update`

Download a reusable access token for PedcBioPortal cbioportal_data_access_token.txt from here.

Local Installation

Software Prerequisites

python3 v3.10+
saml2aws - directions to use
Access to AWS Infra PedCBioPortal Import repo for server loading
Access to the postgres D3b Warehouse database at d3b-warehouse-aurora-prd.d3b.io. Need at least read access to tables with the bix_workflows schema

cBio load package v5.4.10 is used in this tool. Refer to INSTALL.md and setup.py for more details.

Installation Steps

Standard references and configs are stored in a separate git repo. Be sure to grab the desired release before pip install as demonstrated below Run on Mgmt-Console-Dev-chopd3bprod@684194535433 EC2 instance

git clone https://github.com/kids-first/kf-cbioportal-etl.git
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*REFS/*' --strip-components=1
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*STUDY_CONFIGS/*' --strip-components=1
pip install /path/to/kf-cbioportal-etl/

Usage

cbio-etl import \
    --db-ini /path/to/db.ini \
    --study pbta_pnoc \
    --sbg-profile default \
    --dgd-status kf

Docker Installation

We've created a Docker image that allows you to dynamically choose which version of cbio-etl to use at runtime.

Installation Steps

docker pull pgc-images.sbgenomics.com/d3b-bixu/cbio-etl:v2.4.1

Runtime and Usage

docker run --rm -it \
    -v /path/to/db.ini:/credentials/db.ini \
    -v /path/to/cbioportal_data_access_token.txt:/credentials/cbioportal_data_access_token.txt \
    -v /path/to/.sevenbridges/credentials:/root/.sevenbridges/credentials \
    -v /path/to/output_dir:/output \
    cbio-etl-runtime-env /bin/bash
    
# If you want to change the refs used, run the code snippet below, otherwise skip
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*REFS/*' --strip-components=1
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*STUDY_CONFIGS/*' --strip-components=1
pip install /path/to/kf-cbioportal-etl/
# Run cbio-etl
cd /output/
cbio-etl update \
    --db-ini /credentials/db.ini \
    --token /credentials/cbioportal_data_access_token.txt \
    --study pbta_pnoc \
    --sbg-profile default \
    --dgd-status kf

Collaborative and Publication Workflows

These are highly specialized cases in which all or most of the data come from a third party, and therefore requires specific case-by-case protocols.

OpenPedCan

See OpenPedCan README

OpenPBTA

See OpenPBTA README

Name		Name	Last commit message	Last commit date
Latest commit History 637 Commits
.github		.github
COLLABORATIONS		COLLABORATIONS
PUBLICATIONS		PUBLICATIONS
cbioportal_etl		cbioportal_etl
docs		docs
images		images
template_credentials		template_credentials
utilities		utilities
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
python_requirements_file.txt		python_requirements_file.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Outline on ETL for converting data from CAVATICA and Data Warehouse to PedcBioPortal format

Overview

Required credentials files

Required for running `cbio-etl update`

Local Installation

Software Prerequisites

Installation Steps

Usage

Docker Installation

Installation Steps

Runtime and Usage

Collaborative and Publication Workflows

OpenPedCan

OpenPBTA

About

Uh oh!

Releases 34

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

kids-first/kf-cbioportal-etl

Folders and files

Latest commit

History

Repository files navigation

Outline on ETL for converting data from CAVATICA and Data Warehouse to PedcBioPortal format

Overview

Required credentials files

Required for running cbio-etl update

Local Installation

Software Prerequisites

Installation Steps

Usage

Docker Installation

Installation Steps

Runtime and Usage

Collaborative and Publication Workflows

OpenPedCan

OpenPBTA

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 34

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Required for running `cbio-etl update`

Packages