Skip to content

Conversation

h-mayorquin
Copy link
Contributor

This PR fixes the memory inefficiency issue reported in #1610 where parsing NIDQ files with digital channels would allocate excessive memory during parse_header(). For a 2.8 GiB NIDQ file, initialization would allocate ~3.9 GiB of RSS, making it impossible to work with large recordings without running out of memory.

The fix implements lazy loading of event data by deferring the np.unpackbits() operation from _parse_header() to _get_event_timestamps(). During initialization, we now only create a memmap view of the digital word channel (virtual memory, no RAM allocation). The actual unpacking into individual event bits only happens when events are accessed, significantly reducing memory footprint during initialization. Additionally, sample_length is now calculated from the actual file size instead of metadata to handle stub/modified files correctly ( we did something similar before I don't know if you remember)

I ran some benchmarks using memray for some work files (a 844 MB NIDQ file): Peak memory reduced from 2.9 GB to 906 MB (68.7% reduction), with real RSS at initialization dropping from ~3.9 GB to ~20 MB (99.5% reduction).

@h-mayorquin h-mayorquin mentioned this pull request Oct 9, 2025
if len(dig_ch) > 0:
event_data = self._events_memmap
# Unpack bits on-demand - this is when memory allocation happens
event_data = np.unpackbits(self._events_memmap_digital_word.astype(np.uint8)[:, None], axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cache this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with this. we were previously caching this, so if we are doing the unpacking we should cache right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't cache any other call to data. Why would this be different?

We were not caching before because we intended it to, we were just loading the data at init because someone outside of the team did the PR I think.

I prefer my resources to be freed outside of the call to get_data, how are you guys thinking about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a fair point. How expensive is unpackbits. I don't know that well. It wouldn't be too hard to say:

if self.event_data is Not None:
event_data = self.event_data
else:
self.event_data = event_data = unpack

I guess the ultimate question is if this is a cheap operation then it doesn't hurt to do it in the init, it's just contrary to our strategy. If it is expensive (time or memory) we should move it. If expensive in memory then caching is expensive. If expensive in time and possibly something a user would want multiple times it would be nice to have a cache.

I'm curious to hear Sam on this one. I could be convinced not to cache it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the giant allocation? Is the memory actually being used to store the events. I just don't know what events look like for this reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question was more:

  1. is the demultiplexing that occurs with the unpackbits (+ the memory allocation) relatively slow or relatively fast and how does it scale with length of recording
  2. Does it always demultiplex to the same n_channels such that one could estimate the memory allocation based on the length of recording or is this variable based on what the user does

I think moving this from parse_header (sorry said init earlier), is smart because we just need the memmap. My further question though for caching is speed and memory allocation after. I don't use events from the neo api at all so for me it's hard to imagine use cases. But maybe someone would try to get the events multiple times for something in which case it might be nice to have a cache if the process is relatively slow. If fast then absolutely no caching.

Copy link
Contributor Author

@h-mayorquin h-mayorquin Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general question is whether we cache data request in general. There is a trade-off between memory allocation (the memory allocation will remain after you do the data request which is bad) vs making it faster the next time you request. So this is the lens to discuss I think:

To my surprise, unpackbits turned out to be more expensive than I expected. However, it’s not unusually costly compared to other operations we perform, such as demultiplexing signals (see Intan) or scaling them (see spike interface return scaled). In those cases, we don’t use caching, so I think we should be consistent: either apply caching across all of them or not use it at all.

Here are some performance comparisons:

echo "=== np.unpackbits ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); np.unpackbits(d.astype(np.uint8)[:, None], axis=1)" && \
echo "=== bitwise_and ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); np.bitwise_and(d, 1) > 0" && \
echo "=== stim decode ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); mag = np.bitwise_and(d, 0xFF); sign = np.bitwise_and(np.right_shift(d, 8), 0x01); np.where(sign == 1, -mag.astype(np.int16), mag.astype(np.int16))" && \
echo "=== arithmetic ===" && python3 -m timeit -n 3 -r 3 "import numpy as np; d = np.random.randint(-32768, 32767, 216_000_000, dtype='int16'); d * 0.195"
=== np.unpackbits ===
3 loops, best of 3: 2.25 sec per loop
=== bitwise_and ===
3 loops, best of 3: 461 msec per loop
=== stim decode ===
3 loops, best of 3: 2.89 sec per loop
=== arithmetic ===
3 loops, best of 3: 1.27 sec per loop

think we should merge this as it is because this is consistent with the way that the library works: there are are similarly expensive data request uncached in other places. It is Ok to change this, but we should adopt a general policy and change it consistently in all the places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think that rawio should remain simple and without these type of complications. People can still cache the data on the spot if they want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this argument in general. Let's wait until tomorrow in case Sam wants to strongly fight for the other direction and if not we merge tomorrow afternoon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@samuelgarcia
Copy link
Contributor

OK for me.
Maybe we should teh fact that the unpacked should be cached or not as an attributes.
maybe not so we can merge.

@zm711 zm711 added this to the 0.14.3 milestone Oct 9, 2025
Copy link
Contributor

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss the cache before merge

@zm711 zm711 merged commit 550dc8d into NeuralEnsemble:master Oct 10, 2025
3 checks passed
@h-mayorquin h-mayorquin deleted the fix_spikeglx_events_memmap branch October 11, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants