[FEAT] in `estimate_u_values()`, take cartesian product and then sample `target_rows` #2298

NickCrews · 2023-02-28T20:14:01Z

NickCrews
Feb 28, 2023

Currently, from reading the source, it looks like we sample, and then block together. Wouldn't it be simpler to do a cartesian product and them sample from this? Am I missing something?

RobinL · 2023-02-28T22:16:56Z

RobinL
Feb 28, 2023
Maintainer

Yes - the way you're interpreting the source code is correct.

Isn't the cartesian product too big for your proposed strategy to work?

So to take an example, suppose there are 1e6 input rows.

Then cartesian product is 1e12 rows, which is usually too large to calculate. So I'm assuming attempting to calculate than and then sampling from it would fail. (I might be wrong about that, the SQL engine may optimise it? )

The current strategy is to sample (say) 1e4 random rows from the input dataset and compute the cartesian product of the sample, which is 1e8 i.e. computationally tractable.

(the actual formula is probably n(n-1)/2 not n^2 but you get the idea)

0 replies

NickCrews · 2023-03-01T04:39:00Z

NickCrews
Mar 1, 2023
Author

Then cartesian product is 1e12 rows, which is usually too large to calculate. So I'm assuming attempting to calculate than and then sampling from it would fail. (I might be wrong about that, the SQL engine may optimise it? )

I thought that engines worked in a streaming fashion so that it doesn't materialize the whole thing, but I just tried

CREATE TABLE t as FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet';
SELECT * FROM t, t as t2 USING SAMPLE 5;

at https://shell.duckdb.org/ and it took 11 seconds, so I think it materialized all 3621030625 rows :( So yes, that really is a problem.

BUT: SELECT * FROM t, t as t2 LIMIT 5; seems to be smarter. From that I tried SELECT * FROM (FROM t USING SAMPLE 60175), (FROM t USING SAMPLE 60175 as t2) LIMIT 5; and that only took 16ms, so I assume that is streaming! Could we make that work? Although I think depending on the implementation of the join, we might be getting all records in right dataset paired with the first record in left dataset, then all records in right paired with the second record in left, etc, like a nested for loop. Which isn't really a good sample of all possible pairings.

0 replies

RobinL · 2023-03-01T07:36:16Z

RobinL
Mar 1, 2023
Maintainer

One of the things going on there is that the first time you write that query in the Duckdb shell, it downloads the dataset and caches it locally. Subsequent queries then run faster.

My guess is that if the operation is a join, then at a minimum it will have to compute the 'list of rows' before then sampling from this list, even if it can somehow avoid calculating and derived fields in these rows (e.g. a levenstein).

0 replies

NickCrews · 2023-03-01T19:09:45Z

NickCrews
Mar 1, 2023
Author

If I run the query multiple times then the times do change a bit to reflect the caching, but not by much. The essence is still the same I think.

I asked a question on stackoverflow and see if we get any bites.

0 replies

NickCrews · 2023-03-01T19:39:03Z

NickCrews
Mar 1, 2023
Author

Also, wait, is the current implementation susceptible to the "oversampling" problem? Depending on which records are sampled before the join, those ones are going to show up in the sample over and over again, and any record not caught in that sample won't be present. I guess since all records have equal probability, there isn't really any bias one way or another. It's just that we don't get as diverse of a spread as we might ideally want.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] in `estimate_u_values()`, take cartesian product and then sample `target_rows` #2298

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[FEAT] in estimate_u_values(), take cartesian product and then sample target_rows #2298

NickCrews Feb 28, 2023

Replies: 5 comments

RobinL Feb 28, 2023 Maintainer

NickCrews Mar 1, 2023 Author

RobinL Mar 1, 2023 Maintainer

NickCrews Mar 1, 2023 Author

NickCrews Mar 1, 2023 Author

[FEAT] in `estimate_u_values()`, take cartesian product and then sample `target_rows` #2298

NickCrews
Feb 28, 2023

RobinL
Feb 28, 2023
Maintainer

NickCrews
Mar 1, 2023
Author

RobinL
Mar 1, 2023
Maintainer

NickCrews
Mar 1, 2023
Author

NickCrews
Mar 1, 2023
Author