Replies: 5 comments
-
Yes - the way you're interpreting the source code is correct. Isn't the cartesian product too big for your proposed strategy to work? So to take an example, suppose there are 1e6 input rows. Then cartesian product is 1e12 rows, which is usually too large to calculate. So I'm assuming attempting to calculate than and then sampling from it would fail. (I might be wrong about that, the SQL engine may optimise it? ) The current strategy is to sample (say) 1e4 random rows from the input dataset and compute the cartesian product of the sample, which is 1e8 i.e. computationally tractable. (the actual formula is probably n(n-1)/2 not n^2 but you get the idea) |
Beta Was this translation helpful? Give feedback.
-
I thought that engines worked in a streaming fashion so that it doesn't materialize the whole thing, but I just tried CREATE TABLE t as FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet';
SELECT * FROM t, t as t2 USING SAMPLE 5; at https://shell.duckdb.org/ and it took 11 seconds, so I think it materialized all 3621030625 rows :( So yes, that really is a problem. BUT: |
Beta Was this translation helpful? Give feedback.
-
One of the things going on there is that the first time you write that query in the Duckdb shell, it downloads the dataset and caches it locally. Subsequent queries then run faster. My guess is that if the operation is a join, then at a minimum it will have to compute the 'list of rows' before then sampling from this list, even if it can somehow avoid calculating and derived fields in these rows (e.g. a levenstein). |
Beta Was this translation helpful? Give feedback.
-
If I run the query multiple times then the times do change a bit to reflect the caching, but not by much. The essence is still the same I think. I asked a question on stackoverflow and see if we get any bites. |
Beta Was this translation helpful? Give feedback.
-
Also, wait, is the current implementation susceptible to the "oversampling" problem? Depending on which records are sampled before the join, those ones are going to show up in the sample over and over again, and any record not caught in that sample won't be present. I guess since all records have equal probability, there isn't really any bias one way or another. It's just that we don't get as diverse of a spread as we might ideally want. |
Beta Was this translation helpful? Give feedback.
-
Currently, from reading the source, it looks like we sample, and then block together. Wouldn't it be simpler to do a cartesian product and them sample from this? Am I missing something?
Beta Was this translation helpful? Give feedback.
All reactions