Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve de-dupe docs #2925

Merged
merged 1 commit into from
Dec 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 86 additions & 10 deletions docs/en/guides/developer/deduplicating-inserts-on-retries.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ When an insert is retried, ClickHouse tries to determine whether the data has al

**Only `*MergeTree` engines support deduplication on insertion.**

For `*ReplicatedMergeTree` engines, insert deduplication is enabled by default and is controlled by the `replicated_deduplication_window` and `replicated_deduplication_window_seconds` settings. For non-replicated `*MergeTree` engines, deduplication is controlled by the `non_replicated_deduplication_window` setting.
For `*ReplicatedMergeTree` engines, insert deduplication is enabled by default and is controlled by the [`replicated_deduplication_window`](/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window) and [`replicated_deduplication_window_seconds`](/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window-seconds) settings. For non-replicated `*MergeTree` engines, deduplication is controlled by the [`non_replicated_deduplication_window`](/docs/en/operations/settings/merge-tree-settings#non-replicated-deduplication-window) setting.

The settings above determine the parameters of the deduplication log for a table. The deduplication log stores a finite number of `block_id`s, which determine how deduplication works (see below).

Expand All @@ -40,11 +40,12 @@ For `INSERT ... SELECT` queries, it is important that the `SELECT` part of the q
When a table has one or more materialized views, the inserted data is also inserted into the destination of those views with the defined transformations. The transformed data is also deduplicated on retries. ClickHouse performs deduplications for materialized views in the same way it deduplicates data inserted into the target table.

You can control this process using the following settings for the source table:
- `replicated_deduplication_window`
- `replicated_deduplication_window_seconds`
- `non_replicated_deduplication_window`

You can also use the user profile setting `deduplicate_blocks_in_dependent_materialized_views`.
- [`replicated_deduplication_window`](/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window)
- [`replicated_deduplication_window_seconds`](/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window-seconds)
- [`non_replicated_deduplication_window`](/docs/en/operations/settings/merge-tree-settings#non-replicated-deduplication-window)

You can also use the user profile setting [`deduplicate_blocks_in_dependent_materialized_views`](/docs/en/operations/settings/settings#deduplicate_blocks_in_dependent_materialized_views).

When inserting blocks into tables under materialized views, ClickHouse calculates the `block_id` by hashing a string that combines the `block_id`s from the source table and additional identifiers. This ensures accurate deduplication within materialized views, allowing data to be distinguished based on its original insertion, regardless of any transformations applied before reaching the destination table under the materialized view.

Expand Down Expand Up @@ -105,6 +106,11 @@ SELECT
_part
FROM dst
ORDER by all;

┌─key─┬─value─┬─_part─────┐
│ 1 │ B │ all_0_0_0 │
│ 2 │ B │ all_1_1_0 │
└─────┴───────┴───────────┘
```

Here we see that two parts have been inserted into the `dst` table. 2 blocks from select -- 2 parts on insert. The parts contains different data.
Expand All @@ -115,6 +121,11 @@ SELECT
_part
FROM mv_dst
ORDER by all;

┌─key─┬─value─┬─_part─────┐
│ 0 │ B │ all_0_0_0 │
│ 0 │ B │ all_1_1_0 │
└─────┴───────┴───────────┘
```

Here we see that 2 parts have been inserted into the `mv_dst` table. That parts contain the same data, however they are not deduplicated.
Expand All @@ -131,16 +142,26 @@ SELECT
FROM dst
ORDER by all;

┌─key─┬─value─┬─_part─────┐
│ 1 │ B │ all_0_0_0 │
│ 2 │ B │ all_1_1_0 │
└─────┴───────┴───────────┘

SELECT
*,
_part
FROM mv_dst
ORDER by all;

┌─key─┬─value─┬─_part─────┐
│ 0 │ B │ all_0_0_0 │
│ 0 │ B │ all_1_1_0 │
└─────┴───────┴───────────┘
```

Here we see that when we retry the inserts, all data is deduplicated. Deduplication works for both the `dst` and `mv_dst` tables.

## Identical blocks on insertion
### Identical blocks on insertion

```sql
CREATE TABLE dst
Expand Down Expand Up @@ -172,11 +193,15 @@ SELECT
_part
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 0 │ A │ all_0_0_0 │
└────────────┴─────┴───────┴───────────┘
```

With the settings above, two blocks result from select– as a result, there should be two blocks for insertion into table `dst`. However, we see that only one block has been inserted into table `dst`. This occurred because second block has been deduplicated. It has the same data and the key for deduplication `block_id` which is calculated as a hash from the inserted data. This behaviour is not what was expected. Such cases are a rare occurence, but theoretically is possible. In order to handle such cases correctly, the user has to provide a `insert_deduplication_token`. Let's fix this with the following examples:
With the settings above, two blocks result from select– as a result, there should be two blocks for insertion into table `dst`. However, we see that only one block has been inserted into table `dst`. This occurred because the second block has been deduplicated. It has the same data and the key for deduplication `block_id` which is calculated as a hash from the inserted data. This behaviour is not what was expected. Such cases are a rare occurence, but theoretically is possible. In order to handle such cases correctly, the user has to provide a `insert_deduplication_token`. Let's fix this with the following examples:

## Identical blocks in insertion with `insert_deduplication_token`
### Identical blocks in insertion with `insert_deduplication_token`

```sql
CREATE TABLE dst
Expand Down Expand Up @@ -208,6 +233,11 @@ SELECT
_part
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 0 │ A │ all_2_2_0 │
│ from dst │ 0 │ A │ all_3_3_0 │
└────────────┴─────┴───────┴───────────┘
```

Two identical blocks have been inserted as expected.
Expand All @@ -227,6 +257,11 @@ SELECT
_part
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 0 │ A │ all_2_2_0 │
│ from dst │ 0 │ A │ all_3_3_0 │
└────────────┴─────┴───────┴───────────┘
```

Retried insertion is deduplicated as expected.
Expand All @@ -246,11 +281,16 @@ SELECT
_part
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 0 │ A │ all_2_2_0 │
│ from dst │ 0 │ A │ all_3_3_0 │
└────────────┴─────┴───────┴───────────┘
```

That insertion is also deduplicated even though it contains different inserted data. Note that `insert_deduplication_token` has higher priority: ClickHouse does not use the hash sum of data when `insert_deduplication_token` is provided.

## Different insert operations generate the same data after transformation in the underlying table of the materialized view
### Different insert operations generate the same data after transformation in the underlying table of the materialized view

```sql
CREATE TABLE dst
Expand Down Expand Up @@ -288,13 +328,21 @@ SELECT
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 1 │ A │ all_0_0_0 │
└────────────┴─────┴───────┴───────────┘

SELECT
'from mv_dst',
*,
_part
FROM mv_dst
ORDER by all;

┌─'from mv_dst'─┬─key─┬─value─┬─_part─────┐
│ from mv_dst │ 0 │ A │ all_0_0_0 │
└───────────────┴─────┴───────┴───────────┘

select 'second attempt';

INSERT INTO dst VALUES (2, 'A');
Expand All @@ -306,17 +354,27 @@ SELECT
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 1 │ A │ all_0_0_0 │
│ from dst │ 2 │ A │ all_1_1_0 │
└────────────┴─────┴───────┴───────────┘

SELECT
'from mv_dst',
*,
_part
FROM mv_dst
ORDER by all;

┌─'from mv_dst'─┬─key─┬─value─┬─_part─────┐
│ from mv_dst │ 0 │ A │ all_0_0_0 │
│ from mv_dst │ 0 │ A │ all_1_1_0 │
└───────────────┴─────┴───────┴───────────┘
```

We insert different data each time. However, the same data is inserted into the `mv_dst` table. Data is not deduplicated because the source data was different.

## Different materialized view inserts into one underlying table with equivelant data
### Different materialized view inserts into one underlying table with equivelant data

```sql
CREATE TABLE dst
Expand Down Expand Up @@ -364,12 +422,21 @@ SELECT
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 1 │ A │ all_0_0_0 │
└────────────┴─────┴───────┴───────────┘

SELECT
'from mv_dst',
*,
_part
FROM mv_dst
ORDER by all;

┌─'from mv_dst'─┬─key─┬─value─┬─_part─────┐
│ from mv_dst │ 0 │ A │ all_0_0_0 │
│ from mv_dst │ 0 │ A │ all_1_1_0 │
└───────────────┴─────┴───────┴───────────┘
```

Two equal blocks inserted to the table `mv_dst` (as expected).
Expand All @@ -386,12 +453,21 @@ SELECT
FROM dst
ORDER by all;

┌─'from dst'─┬─key─┬─value─┬─_part─────┐
│ from dst │ 1 │ A │ all_0_0_0 │
└────────────┴─────┴───────┴───────────┘

SELECT
'from mv_dst',
*,
_part
FROM mv_dst
ORDER by all;

┌─'from mv_dst'─┬─key─┬─value─┬─_part─────┐
│ from mv_dst │ 0 │ A │ all_0_0_0 │
│ from mv_dst │ 0 │ A │ all_1_1_0 │
└───────────────┴─────┴───────┴───────────┘
```

That retry operation is deduplicated on both tables `dst` and `mv_dst`.
Loading