pycytominer integration

In cytomining/pycytominer#78 I am working towards integrating DeepProfiler processing into pycytominer. Currently and by default, DeepProfiler outputs [`.npz`](https://numpy.org/doc/stable/reference/generated/numpy.savez.html) files storing numpy arrays of single cell profiles. In cytomining/DeepProfiler#229 we discuss a potential update to the `.npz` file output to also include metadata information.

There are a couple of decision points that we need to make to move the integration forward, which will be partially driven by the goals in the DeepProfilerExperiments repo. In https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-673839751 I bring up two different points of consideration: 1) How to use `index.csv` and 2) Feature prefix style.

I think both of these decision points are relatively minor, and any pycytominer code will be flexible to handle multiple metadata options and enable a customizable feature prefix. The question about feature prefix is most directly related to what we think the **default** prefix should be (`DP` or `DP_` are two options)

## Additional topics

I think that these topics are more pressing than the first two listed above: Will the profiles be updated for each dataset to include the metadata `.npz` format? Or, will we proceed without recalculating? If we proceed without recalculating (which I think is the likely scenario), we need to settle on pycytominer strategy.

### Strategy

I do not think that pycytominer should include code to parse plate, well, and site information from filenames. This is a very fragile way of storing these variables - I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.

However, since we probably won't recompute profiles, we require a strategy to incorporate metadata from file names. Therefore, I propose that we take multiple pycytominer steps to integrate these metadata (instead of dealing with all of the processing internally in pycytominer). 

The proposed workflow is as follows:

1. Ingest current `.npz` files in pycytominer
2. Extract out plate, well, and site from file name
3. Append these metadata to a pycytominer `load_npz()` output
4. Reingest this file with metadata back into pycytominer and proceed with standard downstream processing

I will proceed with this strategy for now, but please do suggest alternatives! We can always pivot strategies later on if this ends up being clunky or doesn't reduce code. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pycytominer integration #2

Additional topics

Strategy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pycytominer integration #2

Description

Additional topics

Strategy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions