Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scPoli dimensionality and other parameters #260

Open
officialprofile opened this issue Dec 19, 2024 · 0 comments
Open

scPoli dimensionality and other parameters #260

officialprofile opened this issue Dec 19, 2024 · 0 comments

Comments

@officialprofile
Copy link

officialprofile commented Dec 19, 2024

Hello everyone,

I try to integrate around 500 000 cells from around a dozen GSE datasets. I'm not sure however how to asses the optimal number of parameters, the results I get are more or less ok, but not great. I will be very gratefull for clarifying sever things and suggesting what to do and what to avoid.

  1. I assume we should rely only on HVGs, aroung 2000-5000 I guess, not on all genes? Should the count matrix be normalized?
  2. Does the number of total and pre-training epochs affect significantly the final embedding? For example, will changing the default 100 total and 70 pre to 200 and 150 improve noticeably the result?
  3. Is there any rule of thumb on how to asses the number of embedding and latent dimensions? Is 50 and 20 a reasonable choice? Or maybe a 50 and 50? I have the intuition for regular PC or Harmony dimensionality selection, but you neural network has fundamentally different nature, especially the latent dimensionality is a mystery for me.
  4. I don't want to transfer any labels, so I removed cell_type_keys parameter from the scPoli function, a set labeled_indices to []. This is not a correct approach right, we need to set the labeled_indices anyway (scPoli Model for Unsupervised Use #224)?

My code looks like this:

scpoli_model = scPoli(
    adata=new_adata,
    condition_keys=['GSE'],
    # cell_type_keys=cell_type_key,
    embedding_dims=50,
    latent_dim=20,
    recon_loss='nb',
)

scpoli_model.train(
    n_epochs=100,
    pretraining_epochs=70,
    early_stopping_kwargs=early_stopping_kwargs,
    eta=5,
)

scpoli_query = scPoli.load_query_data(
    adata=new_adata,
    reference_model=scpoli_model,
    labeled_indices=[],
)

data_latent= scpoli_query.get_latent(new_adata, mean=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant