Can we get a larger dataset? #124

alex-thc · 2023-07-17T23:04:17Z

Is it possible to get a larger data set, say 2TB or 5TB? Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

alexey-milovidov · 2023-07-24T02:57:33Z

There is a large catalog of prepared datasets: https://clickhouse.com/docs/en/getting-started/example-datasets

For example, these datasets are over 1 TB uncompressed:

Reddit comments;
YouTube likes;
GitHub events;
Wikipedia page views;
Environmental Sensors Data;

They can be loaded into ClickHouse in a few hours.
There is also a list of queries https://github.com/ClickHouse/github-explorer/blob/main/queries.sql

But these datasets are not used in ClickBench, because testing all ~30 database management systems will be too slow.

For example, if you try to load Wikipedia page views (typical time-series dataset) into TimescaleDB (typical time-series DBMS) it will take months, making the benchmark impractical. If you try to load it into DuckDB, it will not load because duckdb is not a production-quality database. If you try to use Druid, or Pinot, you will need a long time to recover after PTSD.

Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

In fact, ClickHouse compresses it to only 9.28 GB. But the benchmark methodology requires one cold run with flushed caches, so it can test the IO subsystem. Also keep in mind, that it requires the usage of gp2 EBS volumes of size 500 GB that has a well-known IO profile (tldr, they are slow).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we get a larger dataset? #124

Can we get a larger dataset? #124

alex-thc commented Jul 17, 2023

alexey-milovidov commented Jul 24, 2023

Can we get a larger dataset? #124

Can we get a larger dataset? #124

Comments

alex-thc commented Jul 17, 2023

alexey-milovidov commented Jul 24, 2023