-
Notifications
You must be signed in to change notification settings - Fork 27
Tutorial CONICSmat; Dataset: SmartSeq2 scRNA seq of Oligodendroglioma
CONICSmat (COpy-Number analysis In single-Cell RNA-Sequencing from an expression matrix ) is a tool to infer large-scale copy number variations (CNVs) from single-cell RNA-seq data. Typically, tumor biopsies consist of a mixture of neoplastic cells and non-neoplastic cells from the tumor micro-environment. While clustering of cells based on expression of highly variable genes may provide preliminary evidence for their identity, only the presence/absence of somatic mutations can be interpreted as proof for cell identity. To infer the copy number status of each cell, CONICSmat fits a two component Gaussian Mixture Model for each user-provided chromosomal region. The mixture model is fit to the average gene expression of genes within a region, for example all genes on chromosome 10, across all cells. Cells with a deletion of the region will show an on average lower expression from the region than cells without the deletion. The posterior probabilities for each cell belonging to one of the components can then be used to construct a heatmap that visualizes the copy number status of each cell. It is important to note that CONICSmat can be run without a definitive normal control, so single cell RNA-seq of paired non-malignant tissue is not needed. The inferred CNV profiles can be used to triage malignant and non-malignant cells, an to subsequently infer mutational phylogenies of cancer cells.
For questions regarding the usage of CONICSmat use the issues section or email Soren.
First we need to install the mandatory libraries:
install.packages("beanplot")
install.packages("mixtools")
install.packages("pheatmap")
install.packages("zoo")
install.packages("squash")
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
And to install and load CONCICSmat
install.packages("devtools")
devtools::install_github("diazlab/CONICS/CONICSmat", dep = FALSE)
Optional: If the user wants to generate tSNE plots based on the transcriptomic data using CONICSmat:
install.packages("Rtsne")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scran")
CONICSmat can be run with at least one and up to three different input files:
- Mandatory: A genes X cells matrix of log2(CPM/10+1) normalized expression counts.
- Optional: A list of regions that have evidence for genomic copy number alterations, for example derived from exome-seq
- Optional: A table with a predicted cell type for each cell. Typically, these predictions are made based on marker gene expression subsequent to clustering in a dimensionality-reduced space. The Seurat R package provides a multitude of elegant functions for this initial, marker-based cell type prediction. We highly recommend to first predict the celltypes based on transcriptomics. Tumor biopsies will in general contain some normal "contamination". These normal cells can be useful in CNV detection. For users that don't want to install Seurat, CONICSmat also provides some functions to generate tSNE plots.
As example, we will use a dataset containing 4,347 cells from 6 adult oligodendroglioma patients.
Tirosh, I., et al. (2016). Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma . Nature 539, 309–313 (10 November 2016). doi: doi:10.1038/nature20123
First, we load the CONICSmat R package.
library(CONICSmat)
The expression matrix can be downloaded from the broad single cell portal here.
The expression is provided as log2(CPM/10+1) in the file "OG_processed_data_portal.txt". Furthermore, the cell-type classification of he authors is available in the file "cell_type_assignment_portal.txt".
In a first step we load the expression matrix:
suva_expr = as.matrix(read.table("OG_processed_data_portal.txt",sep="\t",header=T,row.names=1,check.names=F))
suva_expr [which(is.na(suva_expr ))]=0
The matrix holds the expression of 23,686 genes in 4,347 cells.
dim(suva_expr)
[1] 23686 4347
Each row of is identified by a gene name in HUGO format, each column has a unique cell identifier (here shown for the first 5 genes and cells):
suva_expr[1:5,1:5]
MGH36_P6_A12 MGH36_P6_H09 MGH53_P4_G04 MGH36_P10_G12 MGH53_P2_H12
A1BG 0.0000 0.000 0.0000 0.0000 0.00000
A1BG-AS1 0.0000 0.000 0.0000 0.0000 0.00000
A1CF 0.0000 0.000 0.0000 0.0000 0.02148
A2M 5.7056 4.437 8.0276 5.6288 0.00000
A2M-AS1 0.0000 0.000 4.5347 0.0000 0.00000
In addition, we would like to infer a vector that holds the information from which patient the cells were derived from:
patients=unlist(strsplit(colnames(suva_expr),"_",fixed=TRUE))[seq(1,(3*ncol(suva_expr))-1,3)]
unique(patients)
[1] "MGH36" "MGH53" "MGH54" "MGH60" "97" "MGH97" "93" "MGH93"
To correct the assignment of cells starting with 97 and 93 to the right patients, we modify their patient label
patients[which(patients=="93")]="MGH93"
patients[which(patients=="97")]="MGH97"
Finally, we can download a matrix containing the coordinates of the human chromosome arms here:
regions=read.table("chromosome_arm_positions_grch38.txt",sep="\t",row.names = 1,header = T)
head(regions,n=5)
Chrom Start End Length
1p 1 0 122026459 122026459
1q 1 124932724 248956422 124023698
2p 2 0 92188145 92188145
2q 2 94090557 242193529 148102972
3p 3 0 90772458 90772458
This matrix contains the chromosomal coordinates of all chromosome arms on autosomes. If no information on chromosomal alterations based on DNA sequencing (e.g. exome-seq) is available, this file can be used to test for chromosome arm-scale copy number alterations. If exome-seq is available, this file should be replaced with a file holding large-scale CNVs.
The first step of data processing is to obtain the chromosomal positions of genes in the expression matrix.
gene_pos=getGenePositions(rownames(suva_expr))
Subsequently, we can filter uniformative genes. These are genes which are expressed in only very few cells (default >5 cells).
suva_expr=filterMatrix(suva_expr,gene_pos[,"hgnc_symbol"],minCells=5)
Now, we calculate a normalization factor for each cell. Because the average gene expression in each cell depends on the number of expressed genes (the more genes expressed in one cell, the less reads are "available" per gene), the normalization factor centers the gene expression in each cell around the mean.
normFactor=calcNormFactors(suva_expr)
The next step is to determine if the average gene expression any of the regions show a bimodal distribution across cells. First, the average expression in each cell is centered using the previously calculated normalization factor. Then, the z-score of the centered gene expression across all cells is calculated. Based on these z-scores, a Gaussian mixture model is calculated with the mixtools package. Important: We only calculate results for regions harboring more than 100 expressed genes (as defined by the initial filtering step) to make sure the predictions are not influenced by a few deferentially expressed genes in a small region.
l=plotAll(suva_expr,normFactor,regions,gene_pos,"SUVA_CNVs")
The result are visualized in a pdf file as given by the last argument of the function. In addition a matrix with the Bayesian Information Criterion (BIC) and the adjusted likelihood ratio test (LRT) adjusted p-values for each region is written. In the pdf for each region four plots are being displayed, here as an example for chromosome 1p (deletion of 1p/19q is a hallmark of oligodendrogioma).
Output of the gaussian mixture model (GMM) fitting step. On the top left: Individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of a cell being drawn from that component distribution. The model log likelihood and the chromosomal region are given on topTop right: Cells (dots) and their average expression z-score. Bottom left: Barplot of the Bayesian Information Criterion (BIC) for a one-component and a two-component mixture model. Bottom right: Distribution of posterior probabilities assigned to cells for component two.
We observe a clearly bimodal distribution, reflected by the well-separated density components and posterior probabilities. This suggests that there is cells with a chromosome 1p deletion. In case exome-seq is available, this information can easily be validated. Chromosome 1q on the other hand, shows a different result:
Output as decribed above, for chromosome 1q.
We observe a unimodal distribution and the log likelihood of the two component gaussian mixture model is much lower than for chromosome 1p. Also, we see that the BIC is lower for the two component mixture model on chromosome 1p, but not on chromosome 1q. The BIC is one of the most standard measures to decide the number of components for a mixture model and the model with the lowest BIC is preferred. Therefore, chromosome 1q might not be affected by copy number alterations. This further manifests in the adjusted p-value of a LHR test comparing the log likelihoods of the one component to the likelihood of the two component model. For convenience, a pdf with results for all arms can also be downloaded here.
Now, we can visualize a heatmap of the posterior probabilities of cells for component2 of each region. Here, component2 is defined as the component with the larger mean. Therefore cells with a higher expression at that locus will appear in red, cells with a lower expression in blue. Posterior probabilities are given as a row-wise z-score.
hi=plotHistogram(l,suva_expr,clusters=2,zscoreThreshold=4,patients)
Heatmap of z-scored posterior probabilities for cells (columns) across regions (rows). Columns are clustered via hierarchical clustering using a euclidean distance measure.
The dendrogram cut was set to identify 2 clusters. Cluster 2 (right) is characterized by cells having higher expression form chromosomes 1p and 19q than all other cells. This indicates they might be non-malignant cells not harboring these founder mutations.
To visualize if the CNV clusters are related to transcriptional signatures, we can generate a tSNE plot and visualize the expression of marker genes as well as posterior probabilities. The tSNE plot is based on the expression of the 500 most variable genes in the dataset, estimated via the decomposeVar() function in the scran bioconductor package. Based on theses genes a dimensionality reduction is performed as implemented in the Rtsne package. The results are visualized as a 2D scatterplot. In this scatterplot, the expression of marker genes can be visualized.
vg=detectVarGenes(suva_expr,500)
ts=calculateTsne(suva_expr,vg)
plotTsneGene(ts,suva_expr,c("MBP","CSF1R","ALDOC","OLIG1"))
Tsne visualization of all cells from the Suva dataset utilizing the implementation in the Rtsne package. Cells are colored according to their expression of indicated marker genes (log2(CPM/10+1).
The tSNE plot suggests that there is a subset of normal Oligodendrocytes (MBP+ cells), a subset of microglia (CSF1R+), astrocyte-like tumor cells (ALDOC+) and oligodendroctye-like tumor cells (OLIG1+). But, OLIG1 and ALDOC are also expressed in the MBP+ cluster, so another layer of information is needed to classify the cells into malignant and non-malignant.By crosslinking this marker-based classification with our GMM-based CNV predictions (here chromosome 1q as an example) via the same Tsne, we obtain the follwoing result:
plotTsneProbabilities(ts,suva_expr,l[,"1p"],"Tsne Chr1p")
Tsne plot as shown before, now colored by posterior probability for component 1 of the gaussian mixture model fit on average expression per cell from chromosome 1p.
Strikingly, the two clusters of putative non-malignant cells are those cells which exhibit higher levels of expression from chromosome 1q, suggesting they do not have a loss of this chromosome arm. Notably, CSF1R is encoded on chromosome 5q, MBP on chromosome 18q, so the expression of these genes is not directly affected by chr1 alterations.
To obtain a final assignmant as malignant or non-malignant cells, we first filter uninformative, noisy regions based on the results of the likelihood ratio test and the BIC for each region.
lrbic=read.table("SUVA_CNVs_BIC_LR.txt",sep="\t",header=T,row.names=1,check.names=F)
colnames(lrbic)
candRegions=rownames(lrbic)[which(lrbic[,"BIC difference"]>200 & lrbic[,"LRT adj. p-val"]<0.01)]
For these regions we again generate a heatmap of posterior probabilities.
hi=plotHistogram(l[,candRegions],suva_expr,clusters=4,zscoreThreshold=4,patients)
Heatmap as described before, filtered for a specific set of regions based on significance of the LRT and diffence in BIC. The ID for each cluster is given in grey on the bottom right
Since these results support our previous assignments, we can now assign a label as malignant or non-malignant to each cell based on the clusters returned by the heatmap function. The cluster ID for each cell is returned by the plotHistogram () function and the leftmost cluster has ID 1 (as given by the ID information on the bottom right).
normal= which(hi==1)
tumor=which(hi!=1)
Now, we can plot the posterior probabilities again, but with statistics for normal and tumor cells:
redu=plotAll(suva_expr,normFactor,regions[candRegions,],gene_pos,"SUVA_CNVs_with_info.pdf",normal=normal,tumor=tumor)
Visualization of the GMM step with tumor/normal assignments. On the top left: Individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of a cell being drawn from that component distribution. The model log likelihood and the chromosomal region are given on topTop right: Cells (dots) and their average expression z-score colored by their assigned class. Bottom left: BIC of a 1 component and two component mixture model. Bottom right: Visualization of cells (dots) and their expression z-score. Color indicates the assigned class.
By thresholding on the posterior probabilities we can next generate a binary matrix, where 1 indicates the presence of a CNV and 0 the absence. Based on the average expression in the normal cells, we know if an alteration is either a copy number gain or a loss.
bin_mat=binarizeMatrix(redu,normal,tumor,0.8)
plotBinaryMat(bin_mat,patients,normal,tumor,patient="MGH97")
Heatmap as described before, now only for cells of MGH97 and after transforming the posterior probabilities to 0 or 1. 1 indicates the presence (black), 0 the absence (grey)of a CNV. White indicates that the posterior probability for component 1 was 0.2<p<0.8 and the presence of a CNV could not be called but also not rejected.
Plotting the thresholded matrix for patient MGH97 reveals three groups of cells: Cells without CNVs, one clone with 1p/19q loss, and one clone with 1p/19 loss, chr7 gain and chr4 loss. Interestingly, the exome-sequencing of the tumor DNA and matched blood controls validates the presence of copy number alterations on chromosomes 1p/19, chr4 and chr7 in MGH97 (compare figure1 of Tirosh et al). This underlines the robustness of our approach.
To visualize the breakpoints for each chromosome (and these visualizations can be used to re-generate the figure above with non a more defined set of regions, not just focused on chromosome arms) we can execute the detectBreakPoints function.
detectBreakPoints (suva_expr,normal,tumor,windowsize=101,gene_pos=gene_pos,chr=1,patients=patients,patient="MGH36",breakpoints=regions)
detectBreakPoints (suva_expr,normal,tumor,windowsize=101,gene_pos=gene_pos,chr=4,patients=patients,patient="MGH36",breakpoints=regions)
detectBreakPoints (suva_expr,normal,tumor,windowsize=101,gene_pos=gene_pos,chr=7,patients=patients,patient="MGH36",breakpoints=regions)
detectBreakPoints (suva_expr,normal,tumor,windowsize=101,gene_pos=gene_pos,chr=8,patients=patients,patient="MGH36",breakpoints=regions)
detectBreakPoints (suva_expr,normal,tumor,windowsize=101,gene_pos=gene_pos,chr=10,patients=patients,patient="MGH36",breakpoints=regions)
detectBreakPoints (suva_expr,normal,tumor,windowsize=101,gene_pos=gene_pos,chr=12,patients=patients,patient="MGH36",breakpoints=regions)
Visualization of the smoothed ratio of expression from all tumor vs normal cells (y-axis) on indicated chromosomes. To obtain a smoothed expression we averaged the expression in the pooled normal and the pooled tumor cells in a sliding window of size 101 genes. Genes were ordered according to their chromosomal location. The numbers on the x-axis represent the window position on the sorted list of genes and do not reflect real genomic coordinates. The gene which is closest to the centromere of each chromosome is highlighted with a horizontal line, indicating the break point between p and q arm.
This approach pools all expression in normal cells and compares their ratio to the pooled expression in tumor cells after smoothing the expression of genes in a sliding window of user-defined size (here 101). The ratios are median-centered prior to plotting, assuming that less than 50% of the genome are altered by CNVs.
Alternatively, we can also generate a pdf file containing plots for all regions.
plotAllChromosomes (mat = suva_expr,normal = normal,tumor = tumor,windowsize = 101,gene_pos = gene_pos,fname = "MGH36",patients = patients,patient = "MGH36",breakpoints = regions)
The approach shows clear differences in the normal/tumor ratio for chr1p/19q, 4, 7, and 12 in MGH36. These are the regions which the authors also identify as altered by copy number in their manuscript. We identified the large regions altered by copy number even without the use of the information from exome-sequencing. If exome sequencing is available, we recommend to focus the analysis on regions predicted to be affected by copy number alterations, and check chromosomes without CNVs detected in exome-seq to serve as control regions.
To obtain a matrix of corrected p-values for each cell and each CNV, we identify the component representing the cell population without CNVs for each region. Subsequently, we utilize the estimated parameters (mean and standard deviation) of this component to calculate a p-value for each of the tumor cells based on its z-scored expression with the pnorm() function.
r=generatePvalMat(suva_expr,regions[candRegions,],normFactor,normal,tumor,gene_pos,threshold=0.8)
binr=ifelse(r>0.1,0,1)
boxplot(r)
The matrix r holds all adjusted p-values (Benjamini-Hochberg) for each cell and each CNV candidate region. The matrix binr is a binarized version of r, where the presence of a CNV is thresholded on an adjusted p-val<0.1.
Distribution of adjusted p-values for CNV presence in tumor cells for selected regions.
We observe that of the two oligodendroglioma hallmarks and putative founder events (1p/19q deletion), the 1p deletion is detected in all cells at adj. p <0.05, and the 19q deletion is detected in almost all cells with a few exceptions. These could be false-negative calls, or the 1p deletion might precede the 19q deletion in the development of some tumors.
The binarized matrix can be used subsequently to generate a pylogenetic tree as described here.
Last, we can generate a visualization of chromosomal alterations in each single cell across the genome for each patient. In the underlying function we compute the average expression in normal cells in a smoothing window (again, genes are sorted by genomic position). To reduce noise we filter for genes that are robustly measured in normal and tumor cells. In each cell, we center the expression ratio to this normal control by the mean, assuming that less than 50% of loci will be altered by copy number.
par(mfrow=c(1,1))
plotChromosomeHeatmap(suva_expr,normal = normal, plotcells = which(patients=="MGH36"), gene_pos = gene_pos, windowsize = 121, chr=T, expThresh=0.2, thresh = 1)
plotChromosomeHeatmap(suva_expr,normal = normal, plotcells = which(patients=="MGH97") ,gene_pos = gene_pos,chr=T,windowsize = 121, expThresh=0.2, thresh = 1)
Visualization of chromosomal alterations in cells (rows) along the genome (columns). Colors indicate copy number gains (red) and losses (blue). Expression in each cell was smoothed in a sliding window along genes ordered by their chromosomal location. The average expression in non-malignant cells was used to normalize the expression data. Therefore, non-malignant cells, like the ones on the bottom of the figure for MGH36, show no alterations. Additionally, clones withing one tumor can be visually identified, for example cells harboring 1p/19q, chr4, chr7, chr12 alterations vs cells with just chr1p/19q, chr7 alterations in MGH36.
Here, the windowsize determines the size of the smoothing window, the chromosome vector indicates the chromosome(s) used to sort the cells (for example to identify clones with or without CNVs on chr12 and chr4), expThresh determines the minimum average expression in control and tumor cells for a gene to be considered, and thresh is a visualization threshold. Centered ratios above this threshold will be set to threshold.