You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the purpose of supporting real-time online feature calculation during long experiments or offline processing of big datasets, we need a way to store data efficiently. Here are some options that already came up during the weekly meeting:
Text based: your good old CSV. Bad: awful space efficienty. Good: super-convenient for anyone to work with after, since you can easily look at it or ready it anywhere (even MS Excel for that matter).
Binary encoding: classic HDF5 format, widely used in science, supported in a variety of languages, so very cross-compatible and safe for long-term storage. Open-source too: https://github.com/HDFGroup/hdf5
Specialized time-series DBs: the big advantage of using a database is that it's much easier to query if specific parts of the data need to be retrieved. Also they might provide other benefits, for example InfluxDB website talks about percentage calculation which is a big bottleneck in some of the feature calculations in PyNM. Also with dome DBs there is the option to hold the database entirely in-memory, so we could potentially use it in place of Pandas.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
For the purpose of supporting real-time online feature calculation during long experiments or offline processing of big datasets, we need a way to store data efficiently. Here are some options that already came up during the weekly meeting:
Text based: your good old CSV. Bad: awful space efficienty. Good: super-convenient for anyone to work with after, since you can easily look at it or ready it anywhere (even MS Excel for that matter).
Binary encoding: classic HDF5 format, widely used in science, supported in a variety of languages, so very cross-compatible and safe for long-term storage. Open-source too: https://github.com/HDFGroup/hdf5
Specialized time-series DBs: the big advantage of using a database is that it's much easier to query if specific parts of the data need to be retrieved. Also they might provide other benefits, for example InfluxDB website talks about percentage calculation which is a big bottleneck in some of the feature calculations in PyNM. Also with dome DBs there is the option to hold the database entirely in-memory, so we could potentially use it in place of Pandas.
There is a bunch of options for timeseries database, some are listed here: https://db-engines.com/en/ranking/time+series+dbms
InfluxDB seems like the best free, open-source options.
At any rate, I think ideally we should give the user the option to store the results in a variety of data-formats.
Beta Was this translation helpful? Give feedback.
All reactions