Options for real-time storage of calculated features #321

toniESAT · 2024-05-05T10:18:58Z

toniESAT
May 5, 2024

For the purpose of supporting real-time online feature calculation during long experiments or offline processing of big datasets, we need a way to store data efficiently. Here are some options that already came up during the weekly meeting:

Text based: your good old CSV. Bad: awful space efficienty. Good: super-convenient for anyone to work with after, since you can easily look at it or ready it anywhere (even MS Excel for that matter).
Binary encoding: classic HDF5 format, widely used in science, supported in a variety of languages, so very cross-compatible and safe for long-term storage. Open-source too: https://github.com/HDFGroup/hdf5
Specialized time-series DBs: the big advantage of using a database is that it's much easier to query if specific parts of the data need to be retrieved. Also they might provide other benefits, for example InfluxDB website talks about percentage calculation which is a big bottleneck in some of the feature calculations in PyNM. Also with dome DBs there is the option to hold the database entirely in-memory, so we could potentially use it in place of Pandas.

There is a bunch of options for timeseries database, some are listed here: https://db-engines.com/en/ranking/time+series+dbms
InfluxDB seems like the best free, open-source options.

At any rate, I think ideally we should give the user the option to store the results in a variety of data-formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for real-time storage of calculated features #321

{{title}}

Replies: 0 comments

Select a reply

Options for real-time storage of calculated features #321

toniESAT May 5, 2024

Replies: 0 comments

toniESAT
May 5, 2024