Skip to content

Clustering by variable and not by UMAP after RunHarmony #266

@RRingquist

Description

@RRingquist

Hello, I am attempting to integrate a dataset comprised of 22 samples (~1M cells total) and I specifically would like to regress out two covariates (GCResponse and Subject). I have attempted this two ways with separate issues arising from each. I followed the standard Seurat pipeline for QC on the merged object and found that the two covariates in question showed signficant batch effects:

`

Run QC on the Merged Seurat

Combined[["percent.mt"]] <- PercentageFeatureSet(Combined, pattern = "^MT-")
VlnPlot(Combined, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0)
Combined <- subset(Combined, subset = nFeature_RNA >= 250 & nCount_RNA >= 500 & percent.mt < 15) #Need to adjust these parameters based on violin plot
Combined <- NormalizeData(Combined)

Identify the most variable genes in the merged object

Combined <- FindVariableFeatures(Combined,
selection.method = "vst",
nfeatures = 2000,
verbose = FALSE)

Scale the counts

Combined <- ScaleData(Combined)

Perform PCA

Combined <- RunPCA(Combined, npcs = 50)

Visualize the unintegrated dataset

Combined <- FindNeighbors(Combined, dims = 1:20, reduction = "pca")
Combined <- FindClusters(Combined, resolution = 0.3, cluster.name = "unintegrated_clusters")
Combined <- RunUMAP(Combined, dims = 1:20, reduction = "pca", reduction.name = "umap.unintegrated")
DimPlot(Combined, reduction = "umap.unintegrated", group.by = 'Subject')`

The unintegrated UMAP shows signficant batch effects arising from the 'Subject' covariate (and in turn 'GCResponse'):

Image

Image

I then attempted to use harmony integration via IntegrateLayers:

Integrate the dataset

`library(parallel)
detectCores() #24
num_cores = 30
combined.hy <- IntegrateLayers(
object = Combined, method = HarmonyIntegration,
orig.reduction = "pca", new.reduction = "harmony",
theta = 4, lambda = 0.5,
max.iter.harmony = 20L, max.iter.cluster = 50L,
verbose = TRUE, num.threads = num_cores)

combined.hy <- FindNeighbors(combined.hy, dims = 1:30)
combined.hy <- FindClusters(combined.hy, resolution = 0.3)
combined.hy <- RunUMAP(combined.hy, reduction = "harmony", dims = 1:30, reduction.name = "umap.harmony")
DimPlot(combined.hy, reduction = "umap.harmony", split.by = 'GCResponse', label.size = 2)`

This somewhat mitigated the effects of the 'Subject' covariate, but still resulted in fairly substantial 'GCResponse' effects (especially between the Healthy and High Responder groups, which is unexpected biologically):

Image

Image

Image

I then attempted to instead run harmony integration directly via RunHarmony:

`

Integrate the dataset

combined.Hy <- RunHarmony(Combined,
group.by.vars = c("Subject", "GCResponse"),
reduction = "pca", assay.use = "SCT", reduction.save = "harmony",
dims = 1:30, theta = c(4,4), lambda = 0.5, max_iter = 50)

combined.Hy <- FindNeighbors(combined.Hy, dims = 1:30)
combined.Hy <- FindClusters(combined.Hy, resolution = 0.3)
combined.Hy <- RunUMAP(combined.Hy, reduction = "harmony", dims = 1:30, reduction.name = "umap.harmony")
DimPlot(combined.Hy, reduction = "umap.harmony", split.by = 'GCResponse', label.size = 2) `

After some tuning of the parameters, I achieved a more uniform UMAP with minimal effects from Subject or GCResponse. My issue here is that the clustering now seems to be heavily influenced by GCResponse instead of the UMAP space.

Image

Image

I am at a loss for how to correct for this, as adjusting the dims and resolution parameters in FindNeighbords and FindClusters, respectively, have not helped. Any advice would be highly appreciated, thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions