Description
Hi,
I have learned about this nice method recently, and have been exploring it since then. I have a question about some weird results I am getting with some high dimensional gene expression datasets.
Here I am clustering a matrix of size 2700 × 1000
using default parameters (e.g. Genie(n_clusters=3)
) and visualizing the results using 2D UMAP view. I think it's very clear that there are 3 big clusters in this dataset (which is also in concordance with known biology) however genie fails to properly cluster the dataset and assign most points to a single cluster.
When I reduce the dimensionality to 50 using PCA, results make more sense:
So I was wondering if there is a better way to set the parameters of the method so that clustering results make sense regardless of the dimensionality of the input matrix. What do you think? Interestingly, the original dimensionality in gene expression datasets is around 20,000 which is roughly the number of protein coding genes. So the initial matrix of size 2700 × 1000
is actually already subsetted to 1k genes.
In addition, here I am plotting the graph-based clustering result using the Louvain method (which is the default clustering method in our field):
This result looks very clear and also consistent with biology, I think it'd be cool to use this as some sort of ground truth.
Code for reproducing the plots are given below. It requires scanpy and louvain python packages.
#### Generate the input matrix ####
# pip install scanpy
import scanpy as sc
sc.set_figure_params(dpi=100)
ad = sc.datasets.pbmc3k()
sc.pp.filter_genes(ad, min_cells=10)
ad.layers['counts'] = ad.X.copy()
sc.pp.normalize_total(ad, target_sum=10000)
sc.pp.log1p(ad)
sc.pp.highly_variable_genes(ad, n_top_genes=1000, flavor='seurat_v3', subset=True, layer='counts')
sc.pp.scale(ad, max_value=8)
sc.pp.pca(ad)
sc.pp.neighbors(ad)
sc.tl.umap(ad)
# pip install louvain
sc.tl.louvain(ad, resolution=0.2)
X_hidim = ad.X
X_lodim = ad.obsm['X_pca']
#### Clustering ####
import genieclust
g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(X_hidim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')
g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(X_lodim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')
sc.pl.umap(ad, color='louvain')