Skip to content

Cell type from gene expression #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 76 commits into from
Apr 2, 2025
Merged

Conversation

matanninio
Copy link
Collaborator

@matanninio matanninio commented Jan 21, 2025

This code supports cell-type annotation in mammal. Data is assumed to come in AnnData format (h5ad) and to be per-processed prior to training. A script for performing the per-proceesing is also included, along with a notebook explaining the data processing. A model, trained for 24h on a single GPU is used for testing, but we may want to train it for several more days and export to HF.

@matanninio matanninio requested a review from mosheraboh January 21, 2025 14:05
Copy link
Contributor

@mosheraboh mosheraboh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matan, looks good!
See some inline comments.

key = sample_dict[self._key_name]

# locate the required item
sample_dict[f"{prefix}.scrna"] = self._data[key, :].X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you convert it to geneformer format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GeneFormer is done in preprocess_ann_data (binning and standardization) and data_preprocessing for the sorting and truncating.

added comments

test_size=0.1,
stratify_by=stratify_by,
)
anndata_dict["train"], anndata_dict["valid"] = anndata_train_test_split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the split predefined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not it this version, but this can be added. Is this important for now?

return ans


def load_cell_type_mapping(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me what is cell type mapping?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to convert the names from the ones in the input anndata to the
ones that are known to the tokenizer.

# positive_token_id: 1,
# }
classification_position = 1
if decoder_output_scores is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return just the pred if scores are not available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand you question. The output of this is used by code (metrics) so I would rather we not have potentially confusing answers.

data_path = Path(__file__).parent / data_path
# read files
anndata_object = anndata.read_h5ad(data_path)
preprocess_ann_data(anndata_object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move all the necessary processing to the pre-processing function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split between the two preprocessing stages is needed as some of the prepocessing will not naturally result in an AnnData compatible result. The part done in preprocess_ann_data can be saved as an anndata.

I can move all the processing, which would not be as efficient. Maybe with a cache/Memorization it would work nicely

Copy link
Collaborator

@yoavkt yoavkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly there are some gaps in the readme, it is unclear how to run this example and what to expect once it is running or done. Some comments were made in the code and readme itself.

Copy link
Collaborator

@yoavkt yoavkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed we want to have two data scripts.

  1. From Zhang data to anndata (including but anndata preprocessing)
  2. Modifies anndata to the binned format with the cell types.

The readme data preparation part should reflect this. I suggest:
Title 1: Getting the data ready for training:
Sub title 1.1: Downlaod the Zhang data (optional)
Sub title 1.2: Zheng data to anndata (optional)
Sub title 1.3: From anndata to MAMMAL ready anndata
The scripts should not include README the numbering is just to order them.

@matanninio matanninio requested a review from yoavkt March 30, 2025 11:06
@yoavkt
Copy link
Collaborator

yoavkt commented Mar 31, 2025

Hi so last meeting we talked about having two scripts:

  1. From Zhang data to anndata (including the (I had a typo meant to write the and not but) anndata preprocessing)
  2. Modifies anndata to the binned format with the cell types.

Now as I understood it we have two scripts one that creates the h5ad file and does the preprocessing and another that only does the preprocessing and is called from the first one. So this is not what we discussed.

Again one is Zhang specific, take the data pack it into h5ad file and the other is Mammal specific, add the cell type do the binning.

@matanninio
Copy link
Collaborator Author

Call type observations are part of the normal AnnData, and come with the ann-data file (or they don't, if the cell-types had not been checked). In this specific Zheng data, the expression data and the cell-type data come from different servers, but this is not the typical case. The cell-type is not an add-on, it's an integral part of the data, as are any other observations that may have been made on the cells.
The one non-standard part of the cell-type part of the AnnData is the specific key used to mark the data in the file. This parameter is externalized with the `task.data_module_kwargs.label_name" parameter in the config file (which is used for fine-tuning, but is not ment to be used in other parts of the workflow)

Added a comment on this to the README file (line 36)

I'm not sure what the exact split we wanted between the two scripts. Specifically, running the pre-processing step can be done inside or outside the Zheng build script. I think the decision was to do both, but not call one script from the other. If you can verify this I will make the change

…ame was missed in the last set of renamings
Copy link
Collaborator

@yoavkt yoavkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision was to create two separate scripts. One Zhang specific script with preprocessing etc.. and one to a MAMMAL specific that adds the cell labels for prediction.

The idea was that even if the user decides not to use zhang data he will still have something.

@matanninio
Copy link
Collaborator Author

matanninio commented Mar 31, 2025

to do after meeting:

  1. two scripts
    I. script for zheng_68k that creates a standard AnnData h5ad file with count and cell type (name of key in click)
    II. general script for preprocessing with filtering (params in click), normalization to 1.0, log(1+p) and binning.

  2. script number 1. does not run the preprocessing step on the zheng_68k, (I suggest it will print a message about this needing to be done).

  3. write a clear message stating that this is the preprocessing here is what we found to works well with this task, but may need to be revised if the data process is different.

    • not sure where we wanted this to be. @yoavkt, I will be happy for help with this issue.

@matanninio
Copy link
Collaborator Author

two scripts in place as mentions, including message at the end of the first (if in verbose)
README needs to be updated. @yoavkt, can you verify that the scripts are as intended?

@matanninio
Copy link
Collaborator Author

README has been updated to reflect the new data processing. I did add a message about the parameters of the binning, but I think we wanted something stronger. Would be happy if you could donate a paragraph with an explanation the preprocessing process presented being replaceable.

Copy link
Collaborator

@yoavkt yoavkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small changes I do not view them as a most. You can pull from my point of view,


anndata_object = anndata.read_h5ad(input_h5ad_file)
# process the data - filter out cells with shallow reads, normelize depth and change to log scale of about 0-10 (log_2(1001)~=10)
preprocess_ann_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to put the method preprocess_ann_data here then in pl_data_module

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the same issue of script/no script. I would rather leave it here, if that's OK with you. This is a CLI interface script, very minimal,so it's not going to be confusing this way IMHO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add that in a better implementation of this code that function would be a parameter of the data process, and would happen with read rather than in prep time. Parameter/callback to allow the user to perform the transformation as she pleases. But I don't this this is the time.

@matanninio
Copy link
Collaborator Author

Old comment that was stuck in limbo for some days

Added test script- use ./mammal/examples/scrna_cell_type/test_h5ad_data_file.sh ./mammal/examples/scrna_cell_type/data/Zheng_68k.h5ad to try it out. It will fail if the data can not be used for training, and will post suspicious things if things are suspicious.

Consider this a framework, if you feel we should add or removes tests to it, soft or hard, please let me know

@matanninio matanninio merged commit ff9962d into main Apr 2, 2025
2 checks passed
@matanninio matanninio deleted the cell_type_from_gene_expression branch April 2, 2025 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants