Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency
This repository contains code for the analysis performed in the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen et al. Please refer to the details below for finding the path to the relevant analysis.
- As this code as evolved over the course of the project, the figure directories in
src/figures_factory
do not correspond exactly to the figures in the manuscript. Please refer to the index below to find the appropriate notebooks. - The cluster IDs used in the repository are different from that in the paper (unless referred to as "
new_cluster
"). Please seesrc/figures_factory/configs/cluster.tsv
for the conversion between old cluster IDs to the ones used in the paper. - The xOSK (extreme OSK) and hOSK (high OSK) states were previously known as hOSK (high OSK) and low OSK states respectively.
- The peak sets are labeled 1-20 in the same order as in the paper.
Below, you will find the notebooks listed approximately in the order in which they are run, with their descriptions and commits.
- The figures can mostly be generated in any order.
- Figures 3, 4, 5, 6 rely on ChromBPNet models.
- The code is a mix of Python and R notebooks. When possible, the
sessionInfo()
is available in the R notebook at the end. - The analysis directories can be found at
src/analysis
- Note that SnapATAC requires py2. We used a slightly modified version of the original repo with minor bugs fixed.
- The figure sections also include code for relevant supplementary figures.
20200424_ArchR/MakeArrow.ipynb
: takes fragment files from Chromap as input, makes Arrow files as output. Does basic QC with plots, computes TSS enrichment and doublet scores. [commit]20200424_ArchR/DoubletAnalysis.ipynb
: Projects doublets on an old UMAP without doublet removal. Determine a threshold by hand tuning. With a final set of thresholds, writes a set of barcodes for each sample that pass QC. [commit]20200122_snapATAC/snapATAC.ipynb
: Initial scATAC-seq run on 5kb bins. Provides a first UMAP layout and initial clusters, which are manually adjusted. Also outputs fragment files per cluster. TagAligns are created from fragment files, after removing chrM reads. Bulk ATAC-seq pipeline is run on individual clusters to obtain peak calls and signal tracks. [commit] [NB: some parts of the notebook were rerun after the version whose outputs were captured]- Peak resolution script [commit]- this is the peak set used for downstream analyses
20200206_pmat_snapATAC/Pmat.ipynb
: Takes peak by cell matrix, filters out cells with few reads in peaks, recomputes UMAP and writes final peak by cell matrix, peaks, UMAP coordinates and clusters. [commit]20200307_fine_clustering/AssistedFineClustering.ipynb
: Creates peak sets by k-means clustering. Semi-automated step as basic division into 4 categories based on fibroblast/iPSC peaks are enforced. Manual adjustments to clusters also performed later to merge similar clusters and rearrange them. [commit]
20200828_RNA_Seurat/QC.ipynb
: Quality control filters based on mitochondrial fraction, number of unique genes, total UMIs and also OSKM fraction. [commit]20200828_RNA_Seurat/Seurat.ipynb
: Dimensionality reduction and UMAP for scRNA-seq. Proceeds by first subsampling equal cells from each day, performing scaling and PCA, and then projecting remaining cells using the PCA loadings, followed by a UMAP. [commit]
20200828_RNA_Seurat/scATAC_integrate.ipynb
: perform label transfer from ATAC to RNA to get cluster of each RNA cell [commit]20200828_RNA_Seurat/scATAC_integrate_CCA.ipynb
: perform RNA → ATAC transfer and get CCA reduction. Default Seurat imputation does not seem to perform well. Instead, performed harmony on CCA reduction to better align ATAC and RNA cells. [commit]20200925_Peak2Gene/ArchRIntegrate.ipynb
: peak-gene linking, ATAC→RNA closest cell mapping [commit]- Session files
Quantification relies on the fact that Sendai transcripts end at stop codon whereas endogenous transcripts contain 3' UTR.
20211106_sendai_vs_endogenous
: Steps documented in README. For each gene, filter BAM to obtain reads that count towards the gene x count matrix. Use Day 2 pseudo-bulk as all exogenous (Sendai) and iPSC pseudo-bulk as all endogenous. Use 5' end of reads to get read distributions (endogenous vs Sendai). Treat each barcode as being drawn from a mixture model of the two distributions. Used an EM algorithm to estimate per barcode fraction of exogenous transcripts. [commit] [quantification]
20220603_Multiome_ArchR/MakeArrow.ipynb
: takes fragment files from Chromap as input, makes Arrow files as output. Does basic QC with plots, computes TSS enrichment. ArchR doublet scores NOT used. [commit] [QC]20220603_Multiome_AMULET
: AMULET v1.1 used for doublet detection from ATAC. An additional step was performed by looking at the plot of sorted normalised number of overlaps from the Overlap.txt output and using a manual knee-point detection strategy (seeProcessAMULET.ipynb
). This gave ~10% more doublets that make intuitive sense. [commit]20220603_Multiome_ArchR/RunArchR.ipynb
: Initial ArchR run on ATAC part of multiome, with and without doublets. Also added scATAC D2 cells to inspect batch effect. Wrote viable barcodes out. [commit] [Multiome ATAC barcodes]20220606_Multiome_RNA_Seurat/QC.ipynb
: Quality control filters based on mitochondrial fraction, number of unique genes, total UMIs and also OSKM fraction. [commit]20220609_Multiome_SnapATAC/snapATAC.ipynb
: D2 scATAC + Multiome ATAC-seq run on 5kb bins. Filtered D2 scATAC to only those cells in top 5 clusters by count (for D2). Provides an initial UMAP layout. Ran Harmony to remove batch effects between scATAC and Multiome ATAC. [commit]20220606_Multiome_RNA_Seurat/Seurat.ipynb
: Dimensionality reduction and UMAP for scRNA-seq. Doublet detection using DoubletFinder as for scRNA-seq. some cells with OSKM expression clustered together in RNA-seq in one corner — most of them did not pass ATAC QC. Use OSKM% < 0.5 as a filter since this is single nucleus and shouldn’t expect much/any Sendai transcripts. Wrote final set of barcodes after intersecting with ATAC viable barcodes. [commit] [barcodes]20220609_Multiome_SnapATAC/snapATAC_make_pmat.ipynb
: Made peaks x barcode matrix for D2 scATAC + Multiome ATAC-seq data, using previous set of peaks. [NB: this requiresadd-pmat
step in snapTools] [commit]20220611_Multiome_Label_Transfer/ATAC_Multiome_label_transfer.ipynb
: Transferred labels from scATAC → Multiome using Seurat FindTransferAnchors on Harmony embeddings. [commit]
-
UMAP_stats.ipynb
: UMAP + sample x cluster breakdown plots [commit] -
FRiP.ipynb
: For every ATAC cell, compute fraction of reads in fibroblast peaks and in iPSC and naive/primed ESC peaks [commit] -
GeneSetExpr.ipynb
: Averaged per scRNA cell expression z-score for 84 fibroblast genes and 20 pluripotency associated genes [commit]. Original lists of genes. Subset of these genes that were present in the counts matrix were used (originally 88 and 22). -
GenomeShots.ipynb
: Gene browser shots of specific gene loci along with expression for every cluster [commit] -
GeneScores.ipynb
: ATAC derived gene scores for every single cell. Uses ArchR's computed gene scores and smooths it [commit] -
ExprPlots.ipynb
: scRNA expression plots. Added endogenous/exogenous OSKM plots. Also OAS[1-3,L], DPPA2/3/5, DNMT3L for extended fig. [commit] -
TerminalExpressionMatrix.ipynb
: For a few "terminal" states plot the single-cell matrix of cell state specific gene expression. The gene sets were picked manually from above clustering. Equal number of barcode sampled for each cell state. [commit] -
Trajectory:
analysis/20200217_trajectory/Paga.ipynb
: PAGA analysis for scATAC cells. Uses diffusion map top 10 components. Computes pseudotime— once using fibroblast cells as root and other time using xOSK cells as root. [commit] [outputs]PAGA_connectivities.ipynb
: Plot the PAGA graph [commit]TrajectoryArrows.ipynb
: plot trajectory arrows for main trajectories [commit]
Supplement
Fig1/Supp.ipynb
: CCA plots, transfer scores histogram and transfer scores correlation [commit]20230612_OSKM_expr_field/OSKVectorField.ipynb
: vector field based on sendai expression of OSKM. Suggests cells “fall off” primary reprogramming trajectory into partially reprogrammed state. Discretised UMAP into grid, and computed “speed” based on difference in total sendai expression of adjacent bins. Removed regions with low sendai expression. [commit]- This has been superseded by
20230612_OSKM_expr_field/OSKVectorField_knn.ipynb
. We first find the 14 nearest neighbors of a cell (excluding itself). For each cell and each neighbor, we look at the difference vector between their coordinates in UMAP space. We normalize this and weight be difference in expression between the two points. Finally we compute an average direction for each cell. Beyond this, we discretised the UMAP and followed steps similar toOSKVectorField.ipynb
. [commit]
- This has been superseded by
PseudotimePlots.ipynb
: Plot pseudotime on UMAPs [commit]
[pdfs]
- Expression and ChromVAR:
analysis/20200522_OSK_frip/OSKM_ChromVAR.ipynb
: running ChromVAR using JASPAR/HOCOMOCO TF motifs for OSKM [commit] [outputs]analysis/20200522_OSK_frip/data/pfms/
: PFMs used for ChromVAR [pfms]OSKM_ChromVAR_plot.ipynb
: ChromVAR plots + per cell state box-plots. For box-plots, ChromVAR deviations are min-max normalised per motif to 0.1-0.95 quantiles [commit]figures_factory/ExprPlots.ipynb
: Expression plots per gene [commit]Expr_ChromVAR_collate.ipynb
: collate expression and ChromVAR plots [commit]
OSKM_States.ipynb
: k-means based on OSKM expression only [commit]
Supplement
Per_state_expr.ipynb
: Per cell state normalised expression box-plots. For box-plots, log expressions are min-max normalised per gene to 0-0.99 quantiles [commit]OSKM_per_state_per_day_expr.ipynb
: Median cell OSKM endogenous/exogenous TPM by day x cell state. Retained day x cell state combinations with >50 cells. Note exogenous median is computed from median total - median endogenous. [commit]OSKM_Corr_ChromVAR_Expr.ipynb
: Expression and ChromVAR correlation for across matched ATAC-RNA cells [commit]- Gene sets:
analysis/20200828_RNA_Seurat/FineClustering.ipynb
: Top 5000 variable genes as perFindVariableFeatures
, with at least 50 transcripts across all cells — total 4306 cells. Perform k-means clustering on genes. Perform GO enrichment on peak sets (gprofiler2 versione104_eg51_p15_3922dba
) [commit] [gene sets and GO]
The code used for model training, evaluation and interpretation is available at src/chrombpnet
. However, if you are interested in training ChromBPNet models, it is recommended to use the up-to-date full feature version at https://github.com/kundajelab/chrombpnet.
- Scripts used:
20200626_modeling_runs/chrombpnet-lite/jobscripts
jobscripts for all folds [commit]
- Workflow described in README. [commit]
Curation:
- extract modisco counts motifs PFMs (adding profile gives the same, so instead stuck with counts only). Ignored negative motifs for now.
- Cluster using GimmeMotifs'
gimme cluster
command - Match the clustered motifs with hits from Vierstra's list of motifs using TomTom
- Manually curate the list to include/exclude TFs (
tfs_final.txt
) - Added representative motif for each cluster by choosing the constituent motif (across clusters) with the highest number of seqlets [commit] (
tfs_final_w_rep.txt
) - Added names for clusters and put in a [meme file]
- Mapped all motifs back to curated motifs using TomTom (
20230607_re_modisco_breakdown
). Manually resolved minor discrepancies. Mostly re-labeled motifs that were dimers of the 30 curated motifs (using MOTIFA:MOTIFB format). Use a q-value cutoff of 0.05. [commit] [annotated_motifs] Misc/modisco/Supp_cell_state_x_motif.ipynb
: Plotting fraction of seqlets of each motif (out of all seqlets for that cell state's modisco, transformed to log10 scale) for 30 main motifs (no dimers). Note that KLF and SP seem exclusive, because the motifs are very similar and assignment is made to only one of them. [commit]
Scanning:
- subset Vierstra motif hits to peaks and to selected TFs
- Use importance thresholding to reduce to important hits. This works by creating a null distribution of importance scores given a motif by computing summed absolute importance score (weighted by "silhouette" of motif, i.e. it's max IC score per position) and then thresholding to top 0.99 of this
- Creating a matrix of peak x motif
20200607_ChromVar/ChromVAR.ipynb
: ran ChromVAR on peak x motif matrix [commit]
-
Supp_Vignette_CTCF_footprint_py.ipynb
andSupp_Vignette_CTCF_footprint_R.ipynb
: Vignette with bias model prediction, per state w/ and w/o bias preds , importances at a CTCF motif [commit] -
Fig5/Aff_vs_conc_footprints_all_TFs.ipynb
: Footprinting of all TFs for all cell states. Uses the representative version for each of the 30 curated motifs and splits seqlets into 3 bins (low: quantiles 0.1-0.4, medium 0.4-0.7, high 0.7-1). Introduces seqlets into random negative regions for each cell type. Seqlets may be repeated till at least 1000 total sequences are scored. These are stored in extended figure, and a subset of those are in the supplements (filtered and edited in Illustrator). [commit] [corresponding PWMs] -
Supp/ModelFreeFootprints.ipynb
: Generates footprints by aggregating insertions. Uses MoDISco PWMs (xOSK KLF, OCTSOX; Fibroblast AP1). Performed PWM scanning. Split into log-odds bins ( these bins are slightly different from those used above since it was done before the all motifs x all cell state footprinting). After getting an example's 2000 bp counts matrix (per cell state, bin pair) centered at motifs after accounting for orientation, row normalized the matrix and then averaged across columns to generate footprint [NB: motifs PWMs come fromAff_vs_conc_footprints.ipynb
, which was later not used] [commit] [processing]
LocusExploreShot.ipynb
: NANOG locus with expression plots. [commit]figures_factory/LocusImportanceScores.ipynb
: See Fig4PerGeneEnhTF.ipynb
: TF-enhancer links for genes of interest (”TF2G plot”). Plots dynamic enhancer-TF connections over cell states using correlated peaks of gene of interest. Involves manual interventions: SOX motifs overlapping with OCTSOX are removed, TFAP2 overlapping with KLF are removed, KLF overlapping with CTCF are removed. Motifs are turned off in the cell states in which they are not recovered by TF-MoDISco. [commit]- NB: Assumes that importance scores are only computed in peaks of a given sample. Therefore enhancers that are closed in a cell state will not have any links.
MicroC.ipynb
: Plotting 4DN MicroC data from fibroblasts and hESC at NANOG locus [commit]MoDISco_Breakdown_Plot.ipynb
: Builds on20211118_modisco_breakdown
. Curate top motifs across cell types, assign MoDISco motifs to them using TomTom followed by manual adjustments. Plots breakdown of seqlets per cell state to these top motifs. [commit] [preprocessing]- NB: For some reason I used a different set of curated motifs instead of the 30 in
tfs_final.txt
- This has been superseded by
20230607_re_modisco_breakdown
[commit]. Updated plot can be found inMoDISco_Re_Breakdown_Plot.ipynb
[commit] - This has been further superseded by
MoDISco_Re_Breakdown_Hits_Plot.ipynb
. In this, we use the motif hits, filter them down to motifs picked up by MoDISco in each cell type, and handle SP and KLF manually since their sites tend to overlap a lot. [commit]
- NB: For some reason I used a different set of curated motifs instead of the 30 in
Supplement
Misc/performance/ChromBPNet_performance.ipynb
: Plot counts and profile performance of ChromBPNet models for all folds. Includes plot of read depth of each sample, and best bound for profile performance based on pseudo reps (see below) [commit]Misc/performance/Pseudorep_JSD.ipynb
: Compute JSD of pseudo reps. First considered a strategy of splitting per base counts into 2 exclusive pseudo reps, but the performance of this doesn't seem very good. Instead, take 2 independent pseudo reps (with half the reads of the original sample) and compute JSD [commit]
figures_factory/Fig4_new
Py_Vignette
andR_Vignette
for LEFTY1 enhancer region and CDH1 promoter analyses. Includes in-silico motif mutagenesis for CDH1 promoter.Py_Vignette.ipynb
andR_Vignette.ipynb
: Importance score vignette for selected regions [commit]. Computes and plots for all cell types irrespective of peak status.PeaksToGene.ipynb
: Heatmaps of peak set x cell state, linked gene sets x cell stat, as well as a grid based UMAP plot of per peak set accessibility [commit]CurateGo.ipynb
: GO enrichment using gProf for every gene set linked to peaks in every peak set. Used versione104_eg51_p15_3922dba
of gProf. [commit]BPNetHitsPeakSetMotifs.ipynb
: Counted motifs per peak set, normalised them and plotted a matrix [commit]
Supplement
FineClusteringSuppFig.ipynb
: Fine clusters used for peak clustering [commit]MotifExpressionHeatmap.ipynb
: For main pseudotime trajectory, plot ChromVAR score + expression of correlated TFs [commit]GenePseudotime.ipynb
: For selected TFs, plot ChromVAR score, expression and gene activity score (ArchR derived) along main trajectory pseudotime [commit]- For Venn diagram, merged overlap peaks for each cell state and computed intersection count using bedtools intersect (-wa -u)
Related
20200514_Homer_summary_plot_factory/HomerSummary.ipynb
: Summarised HOMER knownMotif runs on all peak sets in the style of Knaupp et al 2017. Manually selected matching HOMER motifs matching the TF list (JASPAR/HOCOMOCO) used for other analyses [commit] [directory]
-
dhs_overlap.sh
andDHS_overlap.ipynb
[dhs_overlap.sh, DHS_overlap.ipynb] Compute overlap of peak sets with peaks from DNase Index -
make_early_on_peak_set.sh
: set of peaks that is not in peak sets 6,7,8 or fibroblast peaks but is open along all other cell states along the primary reprogramming trajectory [script] -
VortexPlots.ipynb
: Vortex plots with integrative analysis of bulk TF and histone ChIP-seq datasets [commit] -
high_OSK_vs_iPSC_modisco.ipynb
compares OCT-SOX motifs between xOSK and iPSC states. Manually removes a subcluster of OCT-SOX in iPSC which resembled OCT-only motif. Computes log odds scores of instances [notebook]. Plots heatmap of instances. Exports aggregate PWM, CWM, and per subcluster PWM.OS_motifs.ipynb
plots the aggregate PWM and CWM, and per subcluster PWMs [notebook]. -
high_OSK_vs_iPSC_modisco_BPNet_logodds.ipynb
same as above, but computes log odds scores of instances using the BPNet OCTSOX motif [notebook] -
FindThreshold.ipynb
: Using the BPNet derived motif for OSK and HOMER motif for AP1 (GSE21512). Compute log-odds thresholds for canonical OSK and AP1 motifs by scoring xOSK state seqlets and taking 10th percentile of scores as threshold. [commit] -
MotifScanImportanceThreshold.ipynb
: Threshold PWM called motifs by importance scores. Uses xOSK state importance score. Note that not all peaks in peak sets 6,7,8 are also peaks for xOSK, and so hits falling in those are eliminated in this analysis, i.e. only peaks within xOSK state are considered. [notebook] [scan_output] -
Peak_set_x_affinity.ipynb
: Bar plots of affinity for OSK + AP1 in early transient peak sets [commit] -
/oak/stanford/groups/akundaje/surag/projects/scATAC-reprog/clusters/20210714_n64913/extra/D2_c11/
: Subsample of xOSK cluster with Day 2 cells only, with fragment file and bulk pipeline output. -
ChromVAR_affinity.ipynb
: pseudo-binding saturation curves. Stratified newly opened peaks at day 2 (based on peaks called on xOSK cells subsetted to day 2 only) based on affinity of OCTSOX motif present, into 3 bins (low, medium, high). Assigned label corresponding to strongest motif in a peak. Ran ChromVAR on those peak sets and min-max normalized the scores. Repeated analysis for KLF. [commit] -
ChromVAR_affinity_expr_companion.ipynb
: Per cluster log expression for OSK. [commit]
Supplement
VortextPlotsESC.ipynb
: Plot with GSE101074 integration. Plotted fibroblast, pre-iPSC and iPSC signal at naive and primed specific ESC peaks borrowed from GSE101074. [commit]
Related
Supp/SuppVortexPlots.ipynb
: H3K27ac plot for early transient + E-ON peaks [commit]
UMAP_stats.ipynb
: UMAPs [commit]FibrExpr.ipynb
: Expression of fibroblast-specific genes in different multiome clusters. Gene set used came from differential genes between fibroblast and iPSC from scRNA-seq. [commit] [notebook for diff gene calling] [diff genes]PileupPlots.ipynb
: Plot of signal in different cell types and peak sets from D1M and D2M. [commit]AP1_transient_Footprint.ipynb
: Generate aggregate footprints of AP1 at transient sites. Use bias model to get an aggregate null and use that for correction. See README for how regions were generated. [commit] [README]AP1_transient_Footprint_scATAC.ipynb
: Same asAP1_transient_Footprint.ipynb
but using scATAC-seq xOSK and fibroblast data + model. [commit]TransAP1vsFibrExpr.ipynb
: AP1 sequestration/retention plots plotting fraction of reads in transient/existing peaks with AP1 motifs and fibroblast-specific gene expression. [commit]README
: code for how stats were derived for the text [README]:- AP1 peaks overlapping fibroblast peaks and xOSK peaks that are not fibroblast peaks
- xOSK non fibroblast AP1 instances that overlap with early transient peak sets
- fibroblast-specific AP1 instances that are closed early during reprogramming. Overall xOSK seems to have fewer peaks called, so to get a conservative estimate (i.e. underestimate how many peaks are closed), we considered fibroblast specific AP1 instances in peaks that are not open in both xOSK and Intermediate (C11) states.
- note that the hit calling (motif and thresholding) for fibroblast-specific AP1 and xOSK specific AP1 are different. However, using the same one (e.g. from
motifs.importance.thresholded.bed
results in slightly more xOSK-specific hits. The current approach is thus conservative (i.e. underestimating new non-fibroblast AP1 sites). So for consistency with other plots (footprint, sequestration plots etc) it has been left as is.
Supplement
D2_Mixing.ipynb
: Shows how nearest neighbors cross modality (scATAC vs multiome snATAC) change post Harmony [commit]
20230213_design_AP1_sink_vectors
: Simple method to generate AP1 sink vector. Extract genomic instances of AP1 with flank sequence (21bp total). Stitch together and scan for matches to known relevant motifs (using HOMER). Manually edit the sequences to remove non-AP1 matches. Replace AP1 with eitherCCCAAACCC
orGGGTTTGGG
, as scrambling AP1 created sites for OCT or FOX. [commit]
Code used to streamline/organize files to be exported and exposed.
20230814_exports/models/models_mtbatchgen.ipynb
and20230814_exports/models/models_tf2.ipynb
: renamed and saved files using themtbatchgen
env (py3.7, tf 1.14) toout/
dir. Re-created models intf2
env (py3.10, tf 2.8) and loaded weights, and saved in SavedModel format. Also saved the weights. [commit]20230814_exports/scATAC_data.ipynb
: Export cell and peak data. Replaced with new cluster IDs and attached peak set labels to peaks. Note that ~5k peaks are not assigned to any set because of low reads/variance in peak set curation that was performed earlier [commit]20230814_exports/scRNA_data.ipynb
: Export Seurat object after removing unnecessary metadata and adding in cluster labels. Also stored raw counts matrix (matrix market format), genes, cell metadata, and PCA coordinates separately. [commit]20230814_exports/snMultiome_ATAC_data.ipynb
and20230814_exports/snMultiome_RNA_data.ipynb
: Same as above for snMultiome. For ATAC, used integration with D2 and provided transferred cluster labels. No labels provided for RNA. Note that order of cells is different between ATAC and RNA [commit]20230814_exports/prep_vitessce.ipynb
: prepared scRNA object for Vitessce browser GitHub pages deployment. Added endogenous and sendai OSKM expression to matrix. [commit] [reprogramming-vitessce]20230814_exports/GO_sheet.ipynb
: Exported GO terms for peak set → gene sets along with the gene sets as an Excel sheet [commit] [sheet]20230814_exports/scATAC_cluster_data.ipynb
and20230814_exports/models_data.ipynb
: Exported fragment files and peaks called per cluster, and for models: training bigwigs, peak + non peaks, outputs: shap + modisco. [commit]20230814_exports/motifs.ipynb
: Exported consolidated motifs, their representative PFMs, and motif assignments for counts motifs of each cluster. [commit]