genieclust is very sensitive to dimensionality of the input matrix

Hi,

I have learned about this nice method recently, and have been exploring it since then. I have a question about some weird results I am getting with some high dimensional gene expression datasets.

<img width="305" alt="image" src="https://user-images.githubusercontent.com/1140359/97889034-da4f5a00-1cf9-11eb-8a88-0210c97f42bf.png">


Here I am clustering a matrix of size `2700 × 1000` using default parameters (e.g. `Genie(n_clusters=3)`) and visualizing the results using 2D [UMAP](http://umap-learn.readthedocs.io) view. I think it's very clear that there are 3 big clusters in this dataset (which is also in concordance with known biology) however genie fails to properly cluster the dataset and assign most points to a single cluster.

When I reduce the dimensionality to 50 using PCA, results make more sense:

<img width="293" alt="image" src="https://user-images.githubusercontent.com/1140359/97889201-11be0680-1cfa-11eb-831b-850621f8e95b.png">

So I was wondering if there is a better way to set the parameters of the method so that clustering results make sense regardless of the dimensionality of the input matrix. What do you think? Interestingly, the original dimensionality in gene expression datasets is around 20,000 which is roughly the number of protein coding genes. So the initial matrix of size `2700 × 1000` is actually already subsetted to 1k genes.

In addition, here I am plotting the graph-based clustering result using the Louvain method (which is the default clustering method in our field):

<img width="301" alt="image" src="https://user-images.githubusercontent.com/1140359/97889606-8ee97b80-1cfa-11eb-86a5-b975ba6a48bb.png">

This result looks very clear and also consistent with biology, I think it'd be cool to use this as some sort of ground truth.

Code for reproducing the plots are given below. It requires scanpy and louvain python packages.

<details>

```python
#### Generate the input matrix ####

# pip install scanpy
import scanpy as sc

sc.set_figure_params(dpi=100)

ad = sc.datasets.pbmc3k()
sc.pp.filter_genes(ad, min_cells=10)

ad.layers['counts'] = ad.X.copy()
sc.pp.normalize_total(ad, target_sum=10000)
sc.pp.log1p(ad)

sc.pp.highly_variable_genes(ad, n_top_genes=1000, flavor='seurat_v3', subset=True, layer='counts')
sc.pp.scale(ad, max_value=8)
sc.pp.pca(ad)
sc.pp.neighbors(ad)
sc.tl.umap(ad)

# pip install louvain
sc.tl.louvain(ad, resolution=0.2)

X_hidim = ad.X
X_lodim = ad.obsm['X_pca']


#### Clustering ####

import genieclust

g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(X_hidim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')

g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(X_lodim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels')

sc.pl.umap(ad, color='louvain')
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

genieclust is very sensitive to dimensionality of the input matrix #71

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

genieclust is very sensitive to dimensionality of the input matrix #71

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions