You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to get a larger data set, say 2TB or 5TB? Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)
The text was updated successfully, but these errors were encountered:
But these datasets are not used in ClickBench, because testing all ~30 database management systems will be too slow.
For example, if you try to load Wikipedia page views (typical time-series dataset) into TimescaleDB (typical time-series DBMS) it will take months, making the benchmark impractical. If you try to load it into DuckDB, it will not load because duckdb is not a production-quality database. If you try to use Druid, or Pinot, you will need a long time to recover after PTSD.
Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)
In fact, ClickHouse compresses it to only 9.28 GB. But the benchmark methodology requires one cold run with flushed caches, so it can test the IO subsystem. Also keep in mind, that it requires the usage of gp2 EBS volumes of size 500 GB that has a well-known IO profile (tldr, they are slow).
Is it possible to get a larger data set, say 2TB or 5TB? Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)
The text was updated successfully, but these errors were encountered: