Cell type from gene expression #33

matanninio · 2025-01-21T14:05:56Z

This code supports cell-type annotation in mammal. Data is assumed to come in AnnData format (h5ad) and to be per-processed prior to training. A script for performing the per-proceesing is also included, along with a notebook explaining the data processing. A model, trained for 24h on a single GPU is used for testing, but we may want to train it for several more days and export to HF.

… quite ready for PR

…ignment into cell_type_from_gene_expression

…, fixing code rot. Results have not been checked due to ccc outage

mosheraboh

Matan, looks good!
See some inline comments.

mammal/examples/scrna_cell_type/anndata_op.py

mosheraboh · 2025-02-05T08:53:36Z

mammal/examples/scrna_cell_type/anndata_op.py

+        key = sample_dict[self._key_name]
+
+        # locate the required item
+        sample_dict[f"{prefix}.scrna"] = self._data[key, :].X


Where do you convert it to geneformer format?

the GeneFormer is done in preprocess_ann_data (binning and standardization) and data_preprocessing for the sorting and truncating.

added comments

mosheraboh · 2025-02-05T08:54:37Z

mammal/examples/scrna_cell_type/pl_data_module.py

+        test_size=0.1,
+        stratify_by=stratify_by,
+    )
+    anndata_dict["train"], anndata_dict["valid"] = anndata_train_test_split(


Isn't the split predefined?

Not it this version, but this can be added. Is this important for now?

mammal/examples/scrna_cell_type/task.py

mosheraboh · 2025-02-05T08:58:55Z

mammal/examples/scrna_cell_type/task.py

+        return ans
+
+
+def load_cell_type_mapping(


Can you remind me what is cell type mapping?

This is used to convert the names from the ones in the input anndata to the
ones that are known to the tokenizer.

mosheraboh · 2025-02-05T09:00:40Z

mammal/examples/scrna_cell_type/task.py

+        #     positive_token_id: 1,
+        # }
+        classification_position = 1
+        if decoder_output_scores is not None:


Should we return just the pred if scores are not available?

I don't think I understand you question. The output of this is used by code (metrics) so I would rather we not have potentially confusing answers.

mammal/examples/scrna_cell_type/pl_data_module.py

mosheraboh · 2025-02-05T09:03:50Z

mammal/examples/scrna_cell_type/pl_data_module.py

+        data_path = Path(__file__).parent / data_path
+    # read files
+    anndata_object = anndata.read_h5ad(data_path)
+    preprocess_ann_data(anndata_object)


Is it possible to move all the necessary processing to the pre-processing function?

The split between the two preprocessing stages is needed as some of the prepocessing will not naturally result in an AnnData compatible result. The part done in preprocess_ann_data can be saved as an anndata.

I can move all the processing, which would not be as efficient. Maybe with a cache/Memorization it would work nicely

yoavkt

Mainly there are some gaps in the readme, it is unclear how to run this example and what to expect once it is running or done. Some comments were made in the code and readme itself.

mammal/examples/scrna_cell_type/data/README.md

mammal/examples/scrna_cell_type/pl_data_module.py

mammal/examples/scrna_cell_type/task.py

…dded.

…m_gene_expression

…dded.

…into cell_type_from_gene_expression

…his is not yet ready

…e result in a new anndata file.

…s, so added +1

…e. Cleared.

… readme

…ome reason

yoavkt

As we discussed we want to have two data scripts.

From Zhang data to anndata (including but anndata preprocessing)
Modifies anndata to the binned format with the cell types.

The readme data preparation part should reflect this. I suggest:
Title 1: Getting the data ready for training:
Sub title 1.1: Downlaod the Zhang data (optional)
Sub title 1.2: Zheng data to anndata (optional)
Sub title 1.3: From anndata to MAMMAL ready anndata
The scripts should not include README the numbering is just to order them.

mammal/examples/scrna_cell_type/data/Zheng68k_to_anndata.py

mammal/examples/scrna_cell_type/pl_data_module.py

mammal/examples/scrna_cell_type/scRNA_infer.py

mammal/examples/scrna_cell_type/test_h5ad_data_file.sh

Mostly cleanup and a little proper constants instead of inline strings. Co-authored-by: YoavKT <[email protected]>

…runnning the example data

mammal/examples/scrna_cell_type/README.md

yoavkt · 2025-03-31T07:49:59Z

Hi so last meeting we talked about having two scripts:

From Zhang data to anndata (including the (I had a typo meant to write the and not but) anndata preprocessing)
Modifies anndata to the binned format with the cell types.

Now as I understood it we have two scripts one that creates the h5ad file and does the preprocessing and another that only does the preprocessing and is called from the first one. So this is not what we discussed.

Again one is Zhang specific, take the data pack it into h5ad file and the other is Mammal specific, add the cell type do the binning.

matanninio · 2025-03-31T08:59:55Z

Call type observations are part of the normal AnnData, and come with the ann-data file (or they don't, if the cell-types had not been checked). In this specific Zheng data, the expression data and the cell-type data come from different servers, but this is not the typical case. The cell-type is not an add-on, it's an integral part of the data, as are any other observations that may have been made on the cells.
The one non-standard part of the cell-type part of the AnnData is the specific key used to mark the data in the file. This parameter is externalized with the `task.data_module_kwargs.label_name" parameter in the config file (which is used for fine-tuning, but is not ment to be used in other parts of the workflow)

Added a comment on this to the README file (line 36)

I'm not sure what the exact split we wanted between the two scripts. Specifically, running the pre-processing step can be done inside or outside the Zheng build script. I think the decision was to do both, but not call one script from the other. If you can verify this I will make the change

…ame was missed in the last set of renamings

yoavkt

The decision was to create two separate scripts. One Zhang specific script with preprocessing etc.. and one to a MAMMAL specific that adds the cell labels for prediction.

The idea was that even if the user decides not to use zhang data he will still have something.

mammal/examples/scrna_cell_type/README.md

mammal/examples/scrna_cell_type/data/Zheng68k_to_anndata.py

Co-authored-by: YoavKT <[email protected]>

matanninio · 2025-03-31T12:18:48Z

to do after meeting:

two scripts
I. script for zheng_68k that creates a standard AnnData h5ad file with count and cell type (name of key in click)
II. general script for preprocessing with filtering (params in click), normalization to 1.0, log(1+p) and binning.
script number 1. does not run the preprocessing step on the zheng_68k, (I suggest it will print a message about this needing to be done).
write a clear message stating that this is the preprocessing here is what we found to works well with this task, but may need to be revised if the data process is different.
- not sure where we wanted this to be. @yoavkt, I will be happy for help with this issue.

…iomedSciAI/biomed-multi-alignment into cell_type_from_gene_expression

matanninio · 2025-03-31T13:35:51Z

two scripts in place as mentions, including message at the end of the first (if in verbose)
README needs to be updated. @yoavkt, can you verify that the scripts are as intended?

matanninio · 2025-03-31T13:55:42Z

README has been updated to reflect the new data processing. I did add a message about the parameters of the binning, but I think we wanted something stronger. Would be happy if you could donate a paragraph with an explanation the preprocessing process presented being replaceable.

yoavkt

Two small changes I do not view them as a most. You can pull from my point of view,

yoavkt · 2025-03-31T14:32:33Z

mammal/examples/scrna_cell_type/data/process_h5ad_data.py

+
+    anndata_object = anndata.read_h5ad(input_h5ad_file)
+    # process the data - filter out cells with shallow reads, normelize depth and change to log scale of about 0-10 (log_2(1001)~=10)
+    preprocess_ann_data(


I think it makes more sense to put the method preprocess_ann_data here then in pl_data_module

it's the same issue of script/no script. I would rather leave it here, if that's OK with you. This is a CLI interface script, very minimal,so it's not going to be confusing this way IMHO.

I will add that in a better implementation of this code that function would be a parameter of the data process, and would happen with read rather than in prep time. Parameter/callback to allow the user to perform the transformation as she pleases. But I don't this this is the time.

mammal/examples/scrna_cell_type/pl_data_module.py

Co-authored-by: YoavKT <[email protected]>

matanninio · 2025-04-02T05:05:19Z

Old comment that was stuck in limbo for some days

Added test script- use ./mammal/examples/scrna_cell_type/test_h5ad_data_file.sh ./mammal/examples/scrna_cell_type/data/Zheng_68k.h5ad to try it out. It will fail if the data can not be used for training, and will post suspicious things if things are suspicious.

Consider this a framework, if you feel we should add or removes tests to it, soft or hard, please let me know

MATAN NINIO added 8 commits November 24, 2024 16:59

initial example of using mammal for scRNA cell type idenfication. Not…

21e14da

… quite ready for PR

Merge branch 'main' of https://github.com/BiomedSciAI/biomed-multi-al…

91d3812

…ignment into cell_type_from_gene_expression

merged anndata_op code (changed in more than one branch)

43bd58d

Merge branch 'main' into cell_type_from_gene_expression

7a22365

Merge branch 'main' into cell_type_from_gene_expression

acb98d8

cell type finetune runs. mostly some minor naming tweeks and the such…

5ef3c69

…, fixing code rot. Results have not been checked due to ccc outage

mostly comments

f316245

cleanup "saved" file

8ab388a

matanninio requested a review from mosheraboh January 21, 2025 14:05

commited prior to pull of main

c308b29

mosheraboh reviewed Feb 5, 2025

View reviewed changes

yoavkt requested changes Feb 17, 2025

View reviewed changes

MATAN NINIO and others added 18 commits February 17, 2025 15:12

'scalars' was misspelled as 'scalers', but actual scalers are to be a…

bfff2d2

…dded.

Merge remote-tracking branch 'origin/spelling-fix' into cell_type_fro…

a4c65a1

…m_gene_expression

'scalars' was misspelled as 'scalers', but actual scalers are to be a…

69d40e3

…dded.

Merge branch 'spelling-fix' into cell_type_from_gene_expression

fb5814f

added anndata to requirements for the examples

fc0cdc8

comments + cleanup cell + new notebook for anndata processing

b9bf3d9

Merge remote-tracking branch 'origin/cell_type_from_gene_expression' …

1f400fe

…into cell_type_from_gene_expression

Added notebook for preparing the data. Processing is done twice, so t…

655fa23

…his is not yet ready

cleanup and a clear run

74461cc

added missing link

ff67c50

small script to filter and normelize/log1p a anndata file and save th…

6d2bfb0

…e result in a new anndata file.

expanded preprocess ann data function and changed script to call it

c832877

num bind should be the number of bins, not the number of bin-endpoint…

3a2fa76

…s, so added +1

load_cell_type_mapping appeared in two places with the exact same cod…

aafd425

…e. Cleared.

cleared comments

456a4df

added documentation

fa0d544

cleanup and clear run on the data prep notebook and removed data prep…

da4846e

… readme

clear run on the notebook. Pre-commit removed cell output index for s…

0ed194b

…ome reason

yoavkt requested changes Mar 27, 2025

View reviewed changes

matanninio and others added 5 commits March 30, 2025 11:09

Apply suggestions from Yoav's code review

4f2b303

Mostly cleanup and a little proper constants instead of inline strings. Co-authored-by: YoavKT <[email protected]>

readme and comments, and celltype->cell_type, cleanup

a569878

Readme edited to reflect the different stages of data processing for …

21d8277

…runnning the example data

removed test_h4ad_data_file.sh script

308b0e4

added section on inference and trouble shooting

044c178

matanninio requested a review from yoavkt March 30, 2025 11:06

yoavkt reviewed Mar 31, 2025

View reviewed changes

mammal/examples/scrna_cell_type/README.md Outdated Show resolved Hide resolved

MATAN NINIO added 2 commits March 31, 2025 11:35

removed debug print

3a5ac4e

readme clearup on scripts

30663f2

label name parameter in config explained in readme and changed (the n…

aadaada

…ame was missed in the last set of renamings

yoavkt requested changes Mar 31, 2025

View reviewed changes

matanninio and others added 3 commits March 31, 2025 13:13

Update mammal/examples/scrna_cell_type/README.md

dcf2110

Co-authored-by: YoavKT <[email protected]>

Update mammal/examples/scrna_cell_type/README.md

06b8f1d

Co-authored-by: YoavKT <[email protected]>

Update mammal/examples/scrna_cell_type/data/Zheng68k_to_anndata.py

795096f

Co-authored-by: YoavKT <[email protected]>

MATAN NINIO added 4 commits March 31, 2025 15:56

Zheng to anndata file as decided in meeting

f7bf9da

cleanup of preprocess script

b282dc6

normelization value changed to 1.0 (has no effect on the bins.

025820e

Merge branch 'cell_type_from_gene_expression' of https://github.com/B…

1dde6a4

…iomedSciAI/biomed-multi-alignment into cell_type_from_gene_expression

README edited to reflect the changes in the scripts

13569f5

yoavkt approved these changes Mar 31, 2025

View reviewed changes

Update mammal/examples/scrna_cell_type/pl_data_module.py

a6f6862

Co-authored-by: YoavKT <[email protected]>

matanninio merged commit ff9962d into main Apr 2, 2025
2 checks passed

matanninio deleted the cell_type_from_gene_expression branch April 2, 2025 05:05

Cell type from gene expression #33

Cell type from gene expression #33

Uh oh!

Conversation

matanninio commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mosheraboh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoavkt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoavkt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoavkt commented Mar 31, 2025

Uh oh!

matanninio commented Mar 31, 2025

Uh oh!

yoavkt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matanninio commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matanninio commented Mar 31, 2025

Uh oh!

matanninio commented Jan 21, 2025 •

edited

Loading

yoavkt left a comment •

edited

Loading

matanninio commented Mar 31, 2025 •

edited

Loading