Skip to content

Commit

Permalink
Merge pull request #4377 from soerenwolfers/patch-71
Browse files Browse the repository at this point in the history
Update samples.md
  • Loading branch information
szarnyasg authored Dec 17, 2024
2 parents 3aacb04 + 05b96f9 commit 73e338c
Showing 1 changed file with 21 additions and 11 deletions.
32 changes: 21 additions & 11 deletions docs/sql/samples.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,31 +8,41 @@ Samples are used to randomly select a subset of a dataset.

### Examples

Select a sample of 5 rows from `tbl` using system sampling (cluster sampling):
Select a sample of exactly 5 rows from `tbl` using `reservoir` sampling:

```sql
SELECT *
FROM tbl
USING SAMPLE 5;
```

Select a sample of 10% of the table using system sampling (cluster sampling):
Select a sample of *approximately* 10% of the table using `system` sampling:

```sql
SELECT *
FROM tbl
USING SAMPLE 10%;
```

Select a sample of 10% of the table using bernoulli sampling:
> Warning By default, when you specify a percentage, each [*vector*]({% link docs/internals/vector.md %}) is included in the sample with that probability. If your table contains fewer than ~10k rows, it makes sense to specify the `bernoulli` sampling option instead, which applies the probability to each row independently. Even then, you'll sometimes get more and sometimes less than the specified percentage of the number of rows, but it is much less likely that you get no rows at all. To get exactly 10% of rows (up to rounding), you must use the `reservoir` sampling option.
Select a sample of *approximately* 10% of the table using `bernoulli` sampling:

```sql
SELECT *
FROM tbl
USING SAMPLE 10 PERCENT (bernoulli);
```

Select a sample of 50 rows of the table using reservoir sampling with a fixed seed (100):
Select a sample of *exactly* 10% (up to rounding) of the table using `reservoir` sampling:

```sql
SELECT *
FROM tbl
USING SAMPLE 10 PERCENT (reservoir);
```

Select a sample of exactly 50 rows of the table using reservoir sampling with a fixed seed (100):

```sql
SELECT *
Expand All @@ -41,23 +51,23 @@ USING SAMPLE reservoir(50 ROWS)
REPEATABLE (100);
```

Select a sample of 20% of the table using system sampling with a fixed seed (377):
Select a sample of 20% of the table using `system` sampling with a fixed seed (377):

```sql
SELECT *
FROM tbl
USING SAMPLE 20% (system, 377);
```

Select a sample of 20% of `tbl` **before** the join with `tbl2`:
Select a sample of *approximately* 20% of `tbl` **before** the join with `tbl2`:

```sql
SELECT *
FROM tbl TABLESAMPLE reservoir(20%), tbl2
WHERE tbl.i = tbl2.i;
```

Select a sample of 20% of `tbl` **after** the join with `tbl2`:
Select a sample of *approximately* 20% of `tbl` **after** the join with `tbl2`:

```sql
SELECT *
Expand All @@ -74,7 +84,7 @@ Samples allow you to randomly extract a subset of a dataset. Samples are useful

DuckDB supports three different types of sampling methods: `reservoir`, `bernoulli` and `system`. By default, DuckDB uses `reservoir` sampling when an exact number of rows is sampled, and `system` sampling when a percentage is specified. The sampling methods are described in detail below.

Samples require a *sample size*, which is an indication of how many elements will be sampled from the total population. Samples can either be given as a percentage (`10%`) or as a fixed number of rows (`10 rows`). All three sampling methods support sampling over a percentage, but **only** reservoir sampling supports sampling a fixed number of rows.
Samples require a *sample size*, which is an indication of how many elements will be sampled from the total population. Samples can either be given as a percentage (`10%` or `10 PERCENT`) or as a fixed number of rows (`10` or `10 ROWS`). All three sampling methods support sampling over a percentage, but **only** reservoir sampling supports sampling a fixed number of rows.

Samples are probablistic, that is to say, samples can be different between runs *unless* the seed is specifically specified. Specifying the seed *only* guarantees that the sample is the same if multi-threading is not enabled (i.e., `SET threads = 1`). In the case of multiple threads running over a sample, samples are not necessarily consistent even with a fixed seed.

Expand All @@ -91,15 +101,15 @@ Reservoir sampling also incurs an additional performance penalty when multi-proc
### `bernoulli`

Bernoulli sampling can only be used when a sampling percentage is specified. It is rather straightforward: every tuple in the underlying table is included with a chance equal to the specified percentage. As a result, bernoulli sampling can return a different number of tuples even if the same percentage is specified. The amount of rows will generally be more or less equal to the specified percentage of the table, but there will be some variance.
Bernoulli sampling can only be used when a sampling percentage is specified. It is rather straightforward: every row in the underlying table is included with a chance equal to the specified percentage. As a result, bernoulli sampling can return a different number of tuples even if the same percentage is specified. The *expected* number of rows is equal to the specified percentage of the table, but there will be some *variance*.

Because bernoulli sampling is completely independent (there is no shared state), there is no penalty for using bernoulli sampling together with multiple threads.

### `system`

System sampling is a variant of bernoulli sampling with one crucial difference: every *vector* is included with a chance equal to the sampling percentage. This is a form of cluster sampling. System sampling is more efficient than bernoulli sampling, as no per-tuple selections have to be performed. There is almost no extra overhead for using system sampling, whereas bernoulli sampling can add additional cost as it has to perform random number generation for every single tuple.
System sampling is a variant of bernoulli sampling with one crucial difference: every *vector* is included with a chance equal to the sampling percentage. This is a form of cluster sampling. System sampling is more efficient than bernoulli sampling, as no per-tuple selections have to be performed.

System sampling is not suitable for smaller data sets as the granularity of the sampling is on the order of ~1000 tuples. That means that if system sampling is used for small data sets (e.g., 100 rows) either all the data will be filtered out, or all the data will be included.
The *expected* number of rows is still equal to the specified percentage of the table, but the variance is `vectorSize` times higher. As such, system sampling is not suitable for data sets with fewer than ~10k rows, where it can happen that all rows will be filtered out, or all the data will be included, even when you ask for `50 PERCENT`.

## Table Samples

Expand Down

0 comments on commit 73e338c

Please sign in to comment.