-
Notifications
You must be signed in to change notification settings - Fork 15
Cell type from gene expression #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… quite ready for PR
…ignment into cell_type_from_gene_expression
…, fixing code rot. Results have not been checked due to ccc outage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matan, looks good!
See some inline comments.
key = sample_dict[self._key_name] | ||
|
||
# locate the required item | ||
sample_dict[f"{prefix}.scrna"] = self._data[key, :].X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you convert it to geneformer format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the GeneFormer is done in preprocess_ann_data (binning and standardization) and data_preprocessing for the sorting and truncating.
added comments
test_size=0.1, | ||
stratify_by=stratify_by, | ||
) | ||
anndata_dict["train"], anndata_dict["valid"] = anndata_train_test_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the split predefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not it this version, but this can be added. Is this important for now?
return ans | ||
|
||
|
||
def load_cell_type_mapping( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me what is cell type mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to convert the names from the ones in the input anndata to the
ones that are known to the tokenizer.
# positive_token_id: 1, | ||
# } | ||
classification_position = 1 | ||
if decoder_output_scores is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return just the pred if scores are not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand you question. The output of this is used by code (metrics) so I would rather we not have potentially confusing answers.
data_path = Path(__file__).parent / data_path | ||
# read files | ||
anndata_object = anndata.read_h5ad(data_path) | ||
preprocess_ann_data(anndata_object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move all the necessary processing to the pre-processing function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split between the two preprocessing stages is needed as some of the prepocessing will not naturally result in an AnnData compatible result. The part done in preprocess_ann_data can be saved as an anndata.
I can move all the processing, which would not be as efficient. Maybe with a cache/Memorization it would work nicely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly there are some gaps in the readme, it is unclear how to run this example and what to expect once it is running or done. Some comments were made in the code and readme itself.
…m_gene_expression
…into cell_type_from_gene_expression
…his is not yet ready
…e result in a new anndata file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed we want to have two data scripts.
- From Zhang data to anndata (including but anndata preprocessing)
- Modifies anndata to the binned format with the cell types.
The readme data preparation part should reflect this. I suggest:
Title 1: Getting the data ready for training:
Sub title 1.1: Downlaod the Zhang data (optional)
Sub title 1.2: Zheng data to anndata (optional)
Sub title 1.3: From anndata to MAMMAL ready anndata
The scripts should not include README the numbering is just to order them.
Mostly cleanup and a little proper constants instead of inline strings. Co-authored-by: YoavKT <[email protected]>
…runnning the example data
Hi so last meeting we talked about having two scripts:
Now as I understood it we have two scripts one that creates the h5ad file and does the preprocessing and another that only does the preprocessing and is called from the first one. So this is not what we discussed. Again one is Zhang specific, take the data pack it into h5ad file and the other is Mammal specific, add the cell type do the binning. |
Call type observations are part of the normal AnnData, and come with the ann-data file (or they don't, if the cell-types had not been checked). In this specific Zheng data, the expression data and the cell-type data come from different servers, but this is not the typical case. The cell-type is not an add-on, it's an integral part of the data, as are any other observations that may have been made on the cells. Added a comment on this to the README file (line 36) I'm not sure what the exact split we wanted between the two scripts. Specifically, running the pre-processing step can be done inside or outside the Zheng build script. I think the decision was to do both, but not call one script from the other. If you can verify this I will make the change |
…ame was missed in the last set of renamings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision was to create two separate scripts. One Zhang specific script with preprocessing etc.. and one to a MAMMAL specific that adds the cell labels for prediction.
The idea was that even if the user decides not to use zhang data he will still have something.
Co-authored-by: YoavKT <[email protected]>
Co-authored-by: YoavKT <[email protected]>
Co-authored-by: YoavKT <[email protected]>
to do after meeting:
|
…iomedSciAI/biomed-multi-alignment into cell_type_from_gene_expression
two scripts in place as mentions, including message at the end of the first (if in verbose) |
README has been updated to reflect the new data processing. I did add a message about the parameters of the binning, but I think we wanted something stronger. Would be happy if you could donate a paragraph with an explanation the preprocessing process presented being replaceable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small changes I do not view them as a most. You can pull from my point of view,
|
||
anndata_object = anndata.read_h5ad(input_h5ad_file) | ||
# process the data - filter out cells with shallow reads, normelize depth and change to log scale of about 0-10 (log_2(1001)~=10) | ||
preprocess_ann_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to put the method preprocess_ann_data
here then in pl_data_module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the same issue of script/no script. I would rather leave it here, if that's OK with you. This is a CLI interface script, very minimal,so it's not going to be confusing this way IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add that in a better implementation of this code that function would be a parameter of the data process, and would happen with read rather than in prep time. Parameter/callback to allow the user to perform the transformation as she pleases. But I don't this this is the time.
Co-authored-by: YoavKT <[email protected]>
Old comment that was stuck in limbo for some days Added test script- use Consider this a framework, if you feel we should add or removes tests to it, soft or hard, please let me know |
This code supports cell-type annotation in mammal. Data is assumed to come in AnnData format (h5ad) and to be per-processed prior to training. A script for performing the per-proceesing is also included, along with a notebook explaining the data processing. A model, trained for 24h on a single GPU is used for testing, but we may want to train it for several more days and export to HF.