-
Notifications
You must be signed in to change notification settings - Fork 222
Description
(I had a call with @samuelgarcia this morning about this issue -- filing it here for tracking purposes.)
I am revisiting some of my old unit data, which was sorted and postprocessed using, I think, spikeinterface == 0.98.0.dev0
, or thereabouts. I am trying to convert some old WaveformExtractor
folders to the newer SortingAnalyzer
folders. I can load a MockWaveformExtractor
using:
waveform_sorting = spikeinterface.extractors.read_kilosort(sorter_output_dir)
we = si.load_waveforms(waveform_output_dir, with_recording=False, sorting=waveform_sorting) # Takes ~15 min.
Then I can write the SortingAnalyzer
to disk like so:
sa = we.sorting_analyzer
analyzer_output_dir = waveform_output_dir.parent / "si_sorting_analyzer"
sa.save_as(folder=analyzer_output_dir, format="binary_folder") # Takes ~2min
If I try to write the SortingAnalzyer
to disk using format="zarr"
, I get an error: ValueError: Codec does not support buffers of > 2147483647 bytes
. I think this is because the SortingAnalyzer
is trying to write with chunks > 2GB, and this is not supported by numcodecs.Pickle()
in the call to zarr_root.create_dataset("sorting_provenenance", ...)
in SortingAnalzyer.create_zarr()
(line 614). So I tried to just write as a binary folder, which is successful, but I suspect that this issue is related to another issue (the main issue) that I am about to describe.
The main issue: the analyzer_output_dir
that gets created is 22.85 GB, where the waveform_output_dir
was only 6.01 GB.
This is where the disk usage is coming from:
- analyzer_output_dir = 22.85 GB
- /extensions = 6.01 GB # same size as original waveform_output_dir
- /sorting_provenance.pickle = 5.61 GB
- /sorting/provenance.pkl = 5.61 GB (seems duplicated?)
- /sorting/spikes.npy = 5.61 GB (triple duplication? sus)
My guess is that these 5.61 GB files are also the files that zarr_root.create_dataset("sorting_provenance", ...)
above was trying to write to disk, probably unchunked, above.
For reference, the original sorter_output_dir
is 131.19 GB, of which template_features.npy
makes up the largest share (29.94 GB), followed by amplitudes.npy
(1.87 GB) and spike_times.npy
(1.87 GB).
Here is what I think is happening when loading (i.e. creating the MockWaveformExtractor
):
si.load_waveforms
receives an extractor object, which is passed on line 425 to_read_old_waveform_extractor_binary
._read_old_waveform_extractor_binary
passes this extractor as the first argument toSortingAnalyzer.create_memory
on line 498.SortingAnalyzer.create_memory
converts this to aNumpySorting
on line 391. Even thoughwith_metadata=True
, provenance is lost.
Here is what I think is happening when saving the SortingAnalyzer
:
sa.save_as
callssa._save_or_select_or_merge
, which tries to ascertain sorting provenance on line 965 usingsa.get_sorting_provenance()
.sa.get_sorting_provenance()
checkssa.format
, finds thatsa.format == "memory"
, and therefore returnsNone
. Apparently an in-memorySortingAnalyzer
cannot have sorting provenance.- Because
sa.sorting_provenance == None
, it gets set inSortingAnalyzer._save_or_select_merge
tosa.sorting
(theNumpySorting
) on line 968. SortingAnalzyer._save_or_select_merge
passes theNumpySorting
toSortingAnalzyer.create_binary_folder
on line 1002.- The
NumpySorting
gets written to disk twice from a single call tosorting.save
on line 422 ofSortingAnalzyer.create_binary_folder
.BaseExtractor.save_to_folder
testsself.check_serializability("pickle")
on line 963, which passes, soself.dump_to_pickle
writes the first copy of the sorting,provenance.pkl
, on line 965.BaseExtractor.save_to_folder
also callsself._save
without aformat
argument on line 972, which is supplied byBaseSorting._save
, whose defaultformat="numpy_folder"
kwarg triggersNumpyFolderSorting.write_sorting
on line 257, which writes the second copy of the sortingspikes.npy
.
- The
NumpySorting
gets written to disk a third time on line 439 ofSortingAnalyzer.create_binary_folder
bysorting.dump
, aftersorting.check_serializability("pickle")
passes. This writessorting_provenance.pickle
.
😵 !