-
Notifications
You must be signed in to change notification settings - Fork 5
Description
In cytomining/pycytominer#78 I am working towards integrating DeepProfiler processing into pycytominer. Currently and by default, DeepProfiler outputs .npz files storing numpy arrays of single cell profiles. In cytomining/DeepProfiler#229 we discuss a potential update to the .npz file output to also include metadata information.
There are a couple of decision points that we need to make to move the integration forward, which will be partially driven by the goals in the DeepProfilerExperiments repo. In cytomining/DeepProfiler#229 (comment) I bring up two different points of consideration: 1) How to use index.csv and 2) Feature prefix style.
I think both of these decision points are relatively minor, and any pycytominer code will be flexible to handle multiple metadata options and enable a customizable feature prefix. The question about feature prefix is most directly related to what we think the default prefix should be (DP or DP_ are two options)
Additional topics
I think that these topics are more pressing than the first two listed above: Will the profiles be updated for each dataset to include the metadata .npz format? Or, will we proceed without recalculating? If we proceed without recalculating (which I think is the likely scenario), we need to settle on pycytominer strategy.
Strategy
I do not think that pycytominer should include code to parse plate, well, and site information from filenames. This is a very fragile way of storing these variables - I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.
However, since we probably won't recompute profiles, we require a strategy to incorporate metadata from file names. Therefore, I propose that we take multiple pycytominer steps to integrate these metadata (instead of dealing with all of the processing internally in pycytominer).
The proposed workflow is as follows:
- Ingest current
.npzfiles in pycytominer - Extract out plate, well, and site from file name
- Append these metadata to a pycytominer
load_npz()output - Reingest this file with metadata back into pycytominer and proceed with standard downstream processing
I will proceed with this strategy for now, but please do suggest alternatives! We can always pivot strategies later on if this ends up being clunky or doesn't reduce code.