Skip to content

Mishandled sorting provenance during WaveformExtractor to SortingAnalyzer conversion #4017

@grahamfindlay

Description

@grahamfindlay

(I had a call with @samuelgarcia this morning about this issue -- filing it here for tracking purposes.)

I am revisiting some of my old unit data, which was sorted and postprocessed using, I think, spikeinterface == 0.98.0.dev0 , or thereabouts. I am trying to convert some old WaveformExtractor folders to the newer SortingAnalyzer folders. I can load a MockWaveformExtractor using:

waveform_sorting = spikeinterface.extractors.read_kilosort(sorter_output_dir)
we = si.load_waveforms(waveform_output_dir, with_recording=False, sorting=waveform_sorting) # Takes ~15 min.

Then I can write the SortingAnalyzer to disk like so:

sa = we.sorting_analyzer
analyzer_output_dir = waveform_output_dir.parent / "si_sorting_analyzer"
sa.save_as(folder=analyzer_output_dir, format="binary_folder") # Takes ~2min

If I try to write the SortingAnalzyer to disk using format="zarr", I get an error: ValueError: Codec does not support buffers of > 2147483647 bytes. I think this is because the SortingAnalyzer is trying to write with chunks > 2GB, and this is not supported by numcodecs.Pickle() in the call to zarr_root.create_dataset("sorting_provenenance", ...) in SortingAnalzyer.create_zarr() (line 614). So I tried to just write as a binary folder, which is successful, but I suspect that this issue is related to another issue (the main issue) that I am about to describe.

The main issue: the analyzer_output_dir that gets created is 22.85 GB, where the waveform_output_dir was only 6.01 GB.
This is where the disk usage is coming from:

- analyzer_output_dir = 22.85 GB
    - /extensions = 6.01 GB # same size as original waveform_output_dir
    - /sorting_provenance.pickle = 5.61 GB
    - /sorting/provenance.pkl = 5.61 GB (seems duplicated?)
    - /sorting/spikes.npy = 5.61 GB (triple duplication? sus)

My guess is that these 5.61 GB files are also the files that zarr_root.create_dataset("sorting_provenance", ...) above was trying to write to disk, probably unchunked, above.

For reference, the original sorter_output_dir is 131.19 GB, of which template_features.npy makes up the largest share (29.94 GB), followed by amplitudes.npy (1.87 GB) and spike_times.npy (1.87 GB).

Here is what I think is happening when loading (i.e. creating the MockWaveformExtractor):

  1. si.load_waveforms receives an extractor object, which is passed on line 425 to _read_old_waveform_extractor_binary.
  2. _read_old_waveform_extractor_binary passes this extractor as the first argument to SortingAnalyzer.create_memory on line 498.
  3. SortingAnalyzer.create_memory converts this to a NumpySorting on line 391. Even though with_metadata=True, provenance is lost.

Here is what I think is happening when saving the SortingAnalyzer:

  1. sa.save_as calls sa._save_or_select_or_merge, which tries to ascertain sorting provenance on line 965 using sa.get_sorting_provenance().
  2. sa.get_sorting_provenance() checks sa.format, finds that sa.format == "memory", and therefore returns None. Apparently an in-memory SortingAnalyzer cannot have sorting provenance.
  3. Because sa.sorting_provenance == None, it gets set in SortingAnalyzer._save_or_select_merge to sa.sorting (the NumpySorting) on line 968.
  4. SortingAnalzyer._save_or_select_merge passes the NumpySorting to SortingAnalzyer.create_binary_folder on line 1002.
  5. The NumpySorting gets written to disk twice from a single call to sorting.save on line 422 of SortingAnalzyer.create_binary_folder.
    • BaseExtractor.save_to_folder tests self.check_serializability("pickle") on line 963, which passes, so self.dump_to_pickle writes the first copy of the sorting, provenance.pkl, on line 965.
    • BaseExtractor.save_to_folder also calls self._save without a format argument on line 972, which is supplied by BaseSorting._save, whose default format="numpy_folder" kwarg triggers NumpyFolderSorting.write_sorting on line 257, which writes the second copy of the sorting spikes.npy.
  6. The NumpySorting gets written to disk a third time on line 439 of SortingAnalyzer.create_binary_folder by sorting.dump, after sorting.check_serializability("pickle") passes. This writes sorting_provenance.pickle.

😵 !

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreChanges to core module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions