Mishandled sorting provenance during WaveformExtractor to SortingAnalyzer conversion

(I had a call with @samuelgarcia this morning about this issue -- filing it here for tracking purposes.)

I am revisiting some of my old unit data, which was sorted and postprocessed using, I think, `spikeinterface == 0.98.0.dev0` , or thereabouts. I am trying to convert some old `WaveformExtractor` folders to the newer `SortingAnalyzer` folders. I can load a `MockWaveformExtractor` using:
```
waveform_sorting = spikeinterface.extractors.read_kilosort(sorter_output_dir)
we = si.load_waveforms(waveform_output_dir, with_recording=False, sorting=waveform_sorting) # Takes ~15 min.
```
Then I can write the `SortingAnalyzer` to disk like so:
```
sa = we.sorting_analyzer
analyzer_output_dir = waveform_output_dir.parent / "si_sorting_analyzer"
sa.save_as(folder=analyzer_output_dir, format="binary_folder") # Takes ~2min
```

If I try to write the `SortingAnalzyer` to disk using `format="zarr"`, I get an error: `ValueError: Codec does not support buffers of > 2147483647 bytes`. I think this is because the `SortingAnalyzer` is trying to write with chunks > 2GB, and this is not supported by `numcodecs.Pickle()` in the call to `zarr_root.create_dataset("sorting_provenenance", ...)` in `SortingAnalzyer.create_zarr()` (line 614). So I tried to just write as a binary folder, which is successful, but I suspect that this issue is related to another issue (the main issue) that I am about to describe.

The main issue: the `analyzer_output_dir` that gets created is 22.85 GB, where the `waveform_output_dir` was only 6.01 GB.
This is where the disk usage is coming from:
```
- analyzer_output_dir = 22.85 GB
    - /extensions = 6.01 GB # same size as original waveform_output_dir
    - /sorting_provenance.pickle = 5.61 GB
    - /sorting/provenance.pkl = 5.61 GB (seems duplicated?)
    - /sorting/spikes.npy = 5.61 GB (triple duplication? sus)
```
My guess is that these 5.61 GB files are also the files that `zarr_root.create_dataset("sorting_provenance", ...)` above was trying to write to disk, probably unchunked, above.

For reference, the original `sorter_output_dir` is 131.19 GB, of which `template_features.npy` makes up the largest share (29.94 GB), followed by `amplitudes.npy` (1.87 GB) and `spike_times.npy` (1.87 GB).

Here is what I think is happening when loading (i.e. creating the `MockWaveformExtractor`):
1. `si.load_waveforms` receives an extractor object, which is passed on line 425 to `_read_old_waveform_extractor_binary`. 
2. `_read_old_waveform_extractor_binary` passes this extractor as the first argument to `SortingAnalyzer.create_memory` on line 498. 
3. `SortingAnalyzer.create_memory` converts this to a `NumpySorting` on line 391. Even though `with_metadata=True`, provenance is lost. 


Here is what I think is happening when saving the `SortingAnalyzer`:
1. `sa.save_as` calls `sa._save_or_select_or_merge`, which tries to ascertain sorting provenance on line 965 using `sa.get_sorting_provenance()`. 
2. `sa.get_sorting_provenance()` checks `sa.format`, finds that `sa.format == "memory"`, and therefore returns `None`. Apparently an in-memory `SortingAnalyzer` cannot have sorting provenance. 
3. Because `sa.sorting_provenance == None`, it gets set in `SortingAnalyzer._save_or_select_merge` to `sa.sorting` (the `NumpySorting`) on line 968.
4. `SortingAnalzyer._save_or_select_merge` passes the `NumpySorting` to `SortingAnalzyer.create_binary_folder` on line 1002. 
5. The `NumpySorting` gets written to disk *twice* from a single call to `sorting.save` on line 422 of `SortingAnalzyer.create_binary_folder`. 
    - `BaseExtractor.save_to_folder` tests `self.check_serializability("pickle")` on line 963, which passes, so `self.dump_to_pickle` writes the first copy of the sorting, `provenance.pkl`, on line 965. 
    - `BaseExtractor.save_to_folder` also calls `self._save` *without a `format` argument*  on line 972, which is supplied by `BaseSorting._save`, whose default `format="numpy_folder"` kwarg triggers `NumpyFolderSorting.write_sorting` on line 257, which writes the second copy of the sorting `spikes.npy`. 
6. The `NumpySorting` gets written to disk a third time on line 439 of `SortingAnalyzer.create_binary_folder` by `sorting.dump`, after `sorting.check_serializability("pickle")` passes. This writes `sorting_provenance.pickle`. 

😵 !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mishandled sorting provenance during WaveformExtractor to SortingAnalyzer conversion #4017

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mishandled sorting provenance during WaveformExtractor to SortingAnalyzer conversion #4017

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions