Skip to content

genieclust is very sensitive to dimensionality of the input matrix #71

Open
@gokceneraslan

Description

@gokceneraslan

Hi,

I have learned about this nice method recently, and have been exploring it since then. I have a question about some weird results I am getting with some high dimensional gene expression datasets.

image

Here I am clustering a matrix of size 2700 × 1000 using default parameters (e.g. Genie(n_clusters=3)) and visualizing the results using 2D UMAP view. I think it's very clear that there are 3 big clusters in this dataset (which is also in concordance with known biology) however genie fails to properly cluster the dataset and assign most points to a single cluster.

When I reduce the dimensionality to 50 using PCA, results make more sense:

image

So I was wondering if there is a better way to set the parameters of the method so that clustering results make sense regardless of the dimensionality of the input matrix. What do you think? Interestingly, the original dimensionality in gene expression datasets is around 20,000 which is roughly the number of protein coding genes. So the initial matrix of size 2700 × 1000 is actually already subsetted to 1k genes.

In addition, here I am plotting the graph-based clustering result using the Louvain method (which is the default clustering method in our field):

image

This result looks very clear and also consistent with biology, I think it'd be cool to use this as some sort of ground truth.

Code for reproducing the plots are given below. It requires scanpy and louvain python packages.

#### Generate the input matrix ####

# pip install scanpy
import scanpy as sc

sc.set_figure_params(dpi=100)

ad = sc.datasets.pbmc3k()
sc.pp.filter_genes(ad, min_cells=10)

ad.layers['counts'] = ad.X.copy()
sc.pp.normalize_total(ad, target_sum=10000)
sc.pp.log1p(ad)

sc.pp.highly_variable_genes(ad, n_top_genes=1000, flavor='seurat_v3', subset=True, layer='counts')
sc.pp.scale(ad, max_value=8)
sc.pp.pca(ad)
sc.pp.neighbors(ad)
sc.tl.umap(ad)

# pip install louvain
sc.tl.louvain(ad, resolution=0.2)

X_hidim = ad.X
X_lodim = ad.obsm['X_pca']


#### Clustering ####

import genieclust

g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(X_hidim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')

g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(X_lodim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')

sc.pl.umap(ad, color='louvain')

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionA frequently asked question/interesting behaviour/etc.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions