In general, we are creating upload packages converting our data and metadata to satisfy the requirements outlined here. See below for special cases like publications or collaborative efforts
usage: cbio-etl [-h] {import,update} ...
CBio ETL Command Line Tool
positional arguments:
{import,update}
import Run import workflow (Steps 1, 2, 4, 5, 6)
update Run update workflow (Steps 1, 2, 3, 4, 5, 6)
options:
-h, --help show this help message and exit
- Use
cbio-etl import
if importing a new/whole study. Read workflow details here - Use
cbio-etl update
if making changes to existing study (incremental updates). Read workflow details here
The steps in cbio-etl import
are outlined as follows:
- Generate config JSON
- Get study metadata
- Get files from manifest
- Check downloaded files
- Build genomic file package
- Copy the
db.ini
template to/path/to/db.ini
and replace placeholders with your credentials. - Copy the
sevenbridges.ini
template to~/.sevenbridges/credentials
and replace placeholders with your credentials.
- Download a reusable access token for PedcBioPortal
cbioportal_data_access_token.txt
from here.
python3
v3.10+saml2aws
- directions to use- Access to AWS Infra PedCBioPortal Import repo for server loading
- Access to the
postgres
D3b Warehouse database atd3b-warehouse-aurora-prd.d3b.io
. Need at least read access to tables with thebix_workflows
schema
cBio load package v5.4.10 is used in this tool. Refer to INSTALL.md and setup.py for more details.
Standard references and configs are stored in a separate git repo.
Be sure to grab the desired release before pip install as demonstrated below
Run on Mgmt-Console-Dev-chopd3bprod@684194535433
EC2 instance
git clone https://github.com/kids-first/kf-cbioportal-etl.git
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*REFS/*' --strip-components=1
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*STUDY_CONFIGS/*' --strip-components=1
pip install /path/to/kf-cbioportal-etl/
cbio-etl import \
--db-ini /path/to/db.ini \
--study pbta_pnoc \
--sbg-profile default \
--dgd-status kf
We've created a Docker image that allows you to dynamically choose which version of cbio-etl to use at runtime.
docker pull pgc-images.sbgenomics.com/d3b-bixu/cbio-etl:v2.4.1
docker run --rm -it \
-v /path/to/db.ini:/credentials/db.ini \
-v /path/to/cbioportal_data_access_token.txt:/credentials/cbioportal_data_access_token.txt \
-v /path/to/.sevenbridges/credentials:/root/.sevenbridges/credentials \
-v /path/to/output_dir:/output \
cbio-etl-runtime-env /bin/bash
# If you want to change the refs used, run the code snippet below, otherwise skip
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*REFS/*' --strip-components=1
curl -L https://github.com/kids-first/kf-cbioportal-etl-refs/archive/refs/tags/<desired version>.tar.gz | tar -xz -C </path/to/kf-cbioportal-etl/cbioportal_etl/> --wildcards '*STUDY_CONFIGS/*' --strip-components=1
pip install /path/to/kf-cbioportal-etl/
# Run cbio-etl
cd /output/
cbio-etl update \
--db-ini /credentials/db.ini \
--token /credentials/cbioportal_data_access_token.txt \
--study pbta_pnoc \
--sbg-profile default \
--dgd-status kf
These are highly specialized cases in which all or most of the data come from a third party, and therefore requires specific case-by-case protocols.
See OpenPBTA README