macsGESTALT lineage tracing analysis workflow

macsGESTALT is an inducible lineage recorder, enabling simultaneous capture of lineages and transcriptomes from single cells (A) Genetic components of macsGESTALT. (B) Clone-level information is stored in static barcodes, while subclonal phylogenetic information is dynamically encoded into evolving barcodes via insertions and deletions (indels, blue and red bars) induced by administration of doxycycline. (C) Two example clones from a population with n clones, each with a random number of integrated barcodes. Evolving barcode edits are encoded and inherited as cells divide. (D) Generation of a macsGESTALT barcoded population of cells and experimental workflow. (E) macsGESTALT analysis workflow. First, clonal lineage is reconstructed, followed by subclonal reconstruction, and phylogeny building between subclones.

Paper and data access

We introduce macsGESTALT and apply it to study cancer metastasis in our preprint
Raw fastq data from this study can be downloaded from GEO: GSE173958
All intermediate and processed data files can be downloaded from our Mendeley Data repository
For recreating our study or for testing with example data, we recommended downloading the Mendeley Data repository
There, all scripts and output and input data files are available and can be run within a coherent directory structure
Lineage trees from the above paper can be explored interactively with our tree browser

Analysis overview

This readme covers how macsGESTALT single cell lineage tracing data is processed and analyzed from start to finish. It also contains links throughout to where you can find our raw, intermediate, and processed data from the pancreatic cancer metastasis paper. There are 5 main steps to analysis:

Barcode alignment & indel calling
Barcode processing & identifying clones based on static barcodes
Identifying subclones based on evolving barcodes
Intraclonal phylogeny building
Build master edgelist & visualization

Step 1: Alignment & indel calling

This step collapses barcode sequencing reads by UMI, aligns to a reference barcode sequence, and calls indels at each target site.

Inputs

Input files for this step are barcode sequencing demultiplexed fastq files
The barcode fastq files from our study are available through GEO: GSE173958

Scripts

This step is performed with docker and described in more detail here
The specific docker container for macsGESTALT can be downloaded with: docker pull aaronmck/single_cell_gestalt:macsGESTALT
You will need to transfer the following files (provided) as well as a tearsheet (sample info) file (not provided) to the docker container:
- macsGESTALT barcode files: ref_v82.fa, ref_v82.fa.cutSites, ref_v82.fa.primers
- Run script: run_crispr_pipeline_v82.sh
You can then run the alignment and indel calling pipeline in docker as outlined in the link above

Outputs

This pipeline produces a series of files for each sample, but we only use the .stats files for downstream processing
The .stats files produced by our study are available through GEO: GSE173958 or through Mendeley Data

Step 2: Barcode processing & identifying clones based on static barcodes

This step consists of an R Notebook that provides a more detailed walkthrough of each granular processing step inline with code chunks. It consists of 2 main sections each consisting of a series of smaller steps:

Filter aligned barcode sequence data to find real cells and static barcodes
Discover clones and classify cells into clones via static barcode content

Inputs

.stats files for each sample produced in the step above
A list of real singlets with which to perform initial filtering
- Either from transcriptional analysis, for our data this is: whitelist_cancer_cells.txt
- If you're using your own data and you don't have matched transcriptional data, you can simply use 10x's v3 cellranger 3' mRNA whitelist instead

Scripts

R Notebook: clonal-processing.Rmd

Outputs

The clonal-processing.nb.html file for each mouse contains summary plots and running summary tables for different steps of the workflow
precluster_filtered.txt is a large file containing all aligned data remaining after the organizational and QC steps of the "Section 1" portion
final_singlets.txt is a large file containing all info from alignment pipeline for final remaining singlets after all steps
- this includes all aligned transcripts as individual rows and full barcode sequences
final_classifications.txt concisely stores cell type (i.e. singlet, doublet, unmatched, etc) and cloneID if a cell is matched and a singlet
final_editing_data.txt is used for phylogeny building scripts downstream
Only final_classifications.txt and final_editing_data.txt will be used downstream
Corresponding outputs from our study are available through Mendeley Data

Step 3: Identifying subclones based on evolving barcodes

This step consists of another R Notebook, which describes processing in detail inline with code. Briefly, as each cell can contain multiple barcode integrants, these distinct barcodes (defined by their distinct 10bp static identifiers) are merged into a 'barcode-of-barcodes', while also accounting for any missing barcodes that were not recovered in that cell. Cells with identical 'barcode-of-barcodes' are referred to as a 'subclone' as they are indistinguishably closely related. "Barcode-of-barcodes" files are constructed for each clone seperately (as each clone is defined by a different set of static identifiers), which are used downstream for phylogeny building. Note that as our data used pancreatic cancer cells, which are chromosomally unstable, some barcode integrants were duplicated, which impairs downstream reconstruction. Hence, we can't perform subclonal reconstruction on some clones. These are filtered out temporarily at this step and do not have corresponding "barcode-of-barcodes" files.

Inputs

final_editing_data.txt

Scripts

R Notebook: make-alleles-files.Rmd

Outputs

A "barcode-of-barcodes" file for each clone is created and named as, clone_XX_for_tree.txt, and stored in a new directory named /TreeUtils/clone_hmids
Corresponding outputs from our study are available through Mendeley Data

Step 4: Intraclonal phylogeny building

This step uses evolving barcode data from the "barcode-of-barcodes" subclone files to build a seperate tree for each clone, using TreeUtils, which itself uses Camin-Sokal maximum parsimony as implemented in PHYLIP Mix. All of the data files and code used in this section can be downloaded from the TreeUtils folder from Mendeley Data.

Inputs

The TreeUtils/clone_hmids directory containing clone_XX_for_tree.txt files

Scripts

TreeUtils-assembly-1.3.jar performs phylogeny building (don't call this directly)
build_trees.sh runs TreeUtils for every clone in the /TreeUtils/clone_hmids directory automatically

Outputs

All outputs are stored in the /TreeUtils/clone_trees directory
Outputs include json: clone_XX.json and newick: clone_XX.json.newick files for each processed clone
Corresponding outputs from our study are available through Mendeley Data

Step 5: Build master edgelist & visualization

This step consists of another R Notebook make-edgelist.Rmd, which takes all individual newick clone trees produced in the previous step and collates them into one master tree. This step also adds back cells from clones that did not have a corresponding phylogeny (e.g. due to chromosomal instability and barcode integrant duplication). These trees are merged and stored as an "edgelist". Edgelists can be easily converted to other useful data structures in R, such as a graph using the R package igraph. This can then be readily visualized as a circle packing plot or tree-based plots using ggraph.

Inputs

The /TreeUtils/clone_trees directory containing clone_XX_for_tree.txt files
final_classifications.txt for clones which do not have corresponding phylogenies
Additionally, trees can be annotated with single-cell transcriptional information, example scRNA-seq cell metadata (from a Monocle 3 cds) from our metastasis datasets is available through Mendeley Data, files named cds_colData.txt

Scripts

R Notebook: make-edgelist.Rmd performs all collation steps and includes a visualization vignette

Outputs

full_edgelist.txt
Example visualization can be viewed in: make-edgelist.nb.html
Corresponding outputs from our study are available through Mendeley Data

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
M1		M1
M2		M2
alignment-files		alignment-files
github-figs		github-figs
README.md		README.md
build_trees.sh		build_trees.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

macsGESTALT lineage tracing analysis workflow

Paper and data access

Analysis overview

Step 1: Alignment & indel calling

Inputs

Scripts

Outputs

Step 2: Barcode processing & identifying clones based on static barcodes

Inputs

Scripts

Outputs

Step 3: Identifying subclones based on evolving barcodes

Inputs

Scripts

Outputs

Step 4: Intraclonal phylogeny building

Inputs

Scripts

Outputs

Step 5: Build master edgelist & visualization

Inputs

Scripts

Outputs

About

Releases

Packages

Languages

ksimeono/macsGESTALT

Folders and files

Latest commit

History

Repository files navigation

macsGESTALT lineage tracing analysis workflow

Paper and data access

Analysis overview

Step 1: Alignment & indel calling

Inputs

Scripts

Outputs

Step 2: Barcode processing & identifying clones based on static barcodes

Inputs

Scripts

Outputs

Step 3: Identifying subclones based on evolving barcodes

Inputs

Scripts

Outputs

Step 4: Intraclonal phylogeny building

Inputs

Scripts

Outputs

Step 5: Build master edgelist & visualization

Inputs

Scripts

Outputs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages