Skip to content

Latest commit

 

History

History
225 lines (199 loc) · 16.4 KB

datasketches.md

File metadata and controls

225 lines (199 loc) · 16.4 KB
warning title excerpt docs extension repo extension_star_count extension_star_count_pretty extension_download_count extension_download_count_pretty image layout
DO NOT CHANGE THIS MANUALLY, THIS IS GENERATED BY https://github/duckdb/community-extensions repository, check README there
datasketches
DuckDB Community Extensions By utilizing the Apache DataSketches library this extension can efficiently compute approximate distinct item counts and estimations of quantiles, while allowing the sketches to be serialized.
extended_description hello_world
This extension provides an interface to the [Apache DataSketches](https://datasketches.apache.org/) library. This extension enables users to efficiently compute approximate results for large datasets directly within DuckDB, using state-of-the-art streaming algorithms for distinct counting, quantile estimation, and more. ## Why use this extension? DuckDB already has great implementations of [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) via `approx_count_distinct(x)` and [TDigest](https://arxiv.org/abs/1902.04023) via `approx_quantile(x, pos)`, but it doesn't expose the internal state of the aggregates nor allow the the user to tune all of the parameters of the sketches. This extension allows data sketches to be serialized as `BLOB`s which can be stored and shared across different systems, processes, and environments without loss of fidelity. This makes data sketches highly useful in distributed data processing pipelines. This extension has implemented these sketches from Apache DataSketches. - Quantile Estimation - [TDigest](https://datasketches.apache.org/docs/tdigest/tdigest.html) - [Classic Quantile](https://datasketches.apache.org/docs/Quantiles/ClassicQuantilesSketch.html) - [Relative Error Quantile (REQ)](https://datasketches.apache.org/docs/REQ/ReqSketch.html) - [KLL](https://datasketches.apache.org/docs/KLL/KLLSketch.html) - Approximate Distinct Count - [Compressed Probability Counting (CPC)](https://datasketches.apache.org/docs/CPC/CpcSketches.html) - [HyperLogLog (HLL)](https://datasketches.apache.org/docs/HLL/HllSketches.html) For more information and information regarding usage, see the [README](https://github.com/rustyconover/duckdb-datasketches).
-- This is just a demonstration of a single sketch type, -- see the README for more sketches. -- -- Lets simulate a temperature sensor CREATE TABLE readings(temp integer); INSERT INTO readings(temp) select unnest(generate_series(1, 10)); -- Create a sketch by aggregating id over the readings table. SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings; ┌────────────────────────────────────────────────────────────┐ │ datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) │ │ double │ ├────────────────────────────────────────────────────────────┤ │ 0.45 │ └────────────────────────────────────────────────────────────┘ -- Put some more readings in at the high end. INSERT INTO readings(temp) values (10), (10), (10), (10); -- Now the rank of 5 is moved down. SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings; ┌────────────────────────────────────────────────────────────┐ │ datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) │ │ double │ ├────────────────────────────────────────────────────────────┤ │ 0.32142857142857145 │ └────────────────────────────────────────────────────────────┘ -- Lets get the cumulative distribution function from the sketch. SELECT datasketch_tdigest_cdf(datasketch_tdigest(10, temp), [1,5,9]) from readings; ┌──────────────────────────────────────────────────────────────────────────────────┐ │ datasketch_tdigest_cdf(datasketch_tdigest(10, "temp"), main.list_value(1, 5, 9)) │ │ double[] │ ├──────────────────────────────────────────────────────────────────────────────────┤ │ [0.03571428571428571, 0.32142857142857145, 0.6071428571428571, 1.0] │ └──────────────────────────────────────────────────────────────────────────────────┘ -- The sketch can be persisted and updated later when more data -- arrives without having to rescan the previously aggregated data. SELECT datasketch_tdigest(10, temp) from readings; datasketch_tdigest(10, "temp") = \x02\x01\x14\x0A\x00\x04\x00...
build description language license maintainers name version
cmake
By utilizing the Apache DataSketches library this extension can efficiently compute approximate distinct item counts and estimations of quantiles, while allowing the sketches to be serialized.
C++
MIT
rustyconover
datasketches
0.0.2
github ref
rustyconover/duckdb-datasketches
30da391b41248f24daf23d905407c5faf32148ac
17
17
677
677
/images/community_extensions/social_preview/preview_community_extension_datasketches.png
community_extension_doc

Installing and Loading

INSTALL {{ page.extension.name }} FROM community;
LOAD {{ page.extension.name }};

{% if page.docs.hello_world %}

Example

{{ page.docs.hello_world }}```
{% endif %}

{% if page.docs.extended_description %}
### About {{ page.extension.name }}
{{ page.docs.extended_description }}
{% endif %}

### Added Functions

<div class="extension_functions_table"></div>

|               function_name                | function_type | description | comment | examples |
|--------------------------------------------|---------------|-------------|---------|----------|
| datasketch_cpc                             | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_cpc_describe                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_cpc_estimate                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_cpc_is_empty                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_cpc_lower_bound                 | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_cpc_union                       | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_cpc_upper_bound                 | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll                             | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_hll_describe                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll_estimate                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll_is_compact                  | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll_is_empty                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll_lg_config_k                 | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll_lower_bound                 | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_hll_union                       | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_hll_upper_bound                 | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll                             | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_kll_cdf                         | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_describe                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_is_empty                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_is_estimation_mode          | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_k                           | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_max_item                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_min_item                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_n                           | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_normalized_rank_error       | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_num_retained                | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_pmf                         | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_quantile                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_kll_rank                        | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles                       | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_cdf                   | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_describe              | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_is_empty              | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_is_estimation_mode    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_k                     | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_max_item              | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_min_item              | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_n                     | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_normalized_rank_error | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_num_retained          | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_pmf                   | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_quantile              | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_quantiles_rank                  | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req                             | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_req_cdf                         | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_describe                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_is_empty                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_is_estimation_mode          | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_k                           | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_max_item                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_min_item                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_n                           | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_num_retained                | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_pmf                         | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_quantile                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_req_rank                        | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest                         | aggregate     | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_cdf                     | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_describe                | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_is_empty                | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_k                       | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_pmf                     | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_quantile                | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_rank                    | scalar        | NULL        | NULL    | [NULL]   |
| datasketch_tdigest_total_weight            | scalar        | NULL        | NULL    | [NULL]   |

### Added Types

<div class="extension_types_table"></div>

|         type_name          | type_size | logical_type | type_category | internal |
|----------------------------|----------:|--------------|---------------|---------:|
| sketch_cpc                 | 16        | BLOB         | NULL          | true     |
| sketch_hll                 | 16        | BLOB         | NULL          | true     |
| sketch_kll_bigint          | 16        | BLOB         | NULL          | true     |
| sketch_kll_double          | 16        | BLOB         | NULL          | true     |
| sketch_kll_float           | 16        | BLOB         | NULL          | true     |
| sketch_kll_integer         | 16        | BLOB         | NULL          | true     |
| sketch_kll_smallint        | 16        | BLOB         | NULL          | true     |
| sketch_kll_tinyint         | 16        | BLOB         | NULL          | true     |
| sketch_kll_ubigint         | 16        | BLOB         | NULL          | true     |
| sketch_kll_uinteger        | 16        | BLOB         | NULL          | true     |
| sketch_kll_usmallint       | 16        | BLOB         | NULL          | true     |
| sketch_kll_utinyint        | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_bigint    | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_double    | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_float     | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_integer   | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_smallint  | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_tinyint   | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_ubigint   | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_uinteger  | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_usmallint | 16        | BLOB         | NULL          | true     |
| sketch_quantiles_utinyint  | 16        | BLOB         | NULL          | true     |
| sketch_req_bigint          | 16        | BLOB         | NULL          | true     |
| sketch_req_double          | 16        | BLOB         | NULL          | true     |
| sketch_req_float           | 16        | BLOB         | NULL          | true     |
| sketch_req_integer         | 16        | BLOB         | NULL          | true     |
| sketch_req_smallint        | 16        | BLOB         | NULL          | true     |
| sketch_req_tinyint         | 16        | BLOB         | NULL          | true     |
| sketch_req_ubigint         | 16        | BLOB         | NULL          | true     |
| sketch_req_uinteger        | 16        | BLOB         | NULL          | true     |
| sketch_req_usmallint       | 16        | BLOB         | NULL          | true     |
| sketch_req_utinyint        | 16        | BLOB         | NULL          | true     |
| sketch_tdigest_double      | 16        | BLOB         | NULL          | true     |
| sketch_tdigest_float       | 16        | BLOB         | NULL          | true     |