Skip to content

Commit 21f6f54

Browse files
committed
add snapshot duplicate scenarios
1 parent 785cdca commit 21f6f54

File tree

5 files changed

+372
-0
lines changed

5 files changed

+372
-0
lines changed

snapshot-duplicates/README.md

Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
---
2+
---
3+
4+
## Snapshot duplicates
5+
6+
* https://docs.getdbt.com/docs/build/snapshots
7+
* https://docs.getdbt.com/docs/build/snapshots#snapshot-query-best-practices
8+
* https://noahlk.medium.com/debug-diaries-duplicate-records-in-dbt-snapshots-2f1d961e3fd2
9+
* https://gist.github.com/jeremyyeo/7282a2e25d86fe8b449ed70e8cdf10ff
10+
11+
Example scenarios of how dbt snapshots come to have duplicates.
12+
13+
> Following examples are on Snowflake but same concept should apply across different datawarehouses/databases.
14+
15+
### The unique key is not unique
16+
17+
https://docs.getdbt.com/docs/build/snapshots#ensure-your-unique-key-is-really-unique
18+
19+
The unique key in the data we are snapshotting is actually not unique - there is just an assumption that it would always be unique. Keep in mind that just because there is a specified `unique_key` config - this doesn't mean that our raw data will automatically (by the configured `unique_key`) be made unique by dbt.
20+
21+
Additionally the source data we are snapshotting could have been unique at one time but not at another. For example - our raw data looks like:
22+
23+
```
24+
id,name
25+
1,alice
26+
1,alice
27+
2,bob
28+
```
29+
30+
There is an obvious duplicate. Our snapshot runs and it introduces a duplicate into our snapshot table OR it runs into an error like `UPDATE/MERGE must match at most one source row for each target row`. Before we manage to check the raw data (i.e. come into the office for the day to start our work), our EL tool has come along and updated the raw data to be:
31+
32+
```
33+
id,name
34+
2,bob
35+
3,eve
36+
```
37+
38+
We check our raw data and we see that it is really unique now and we wonder why the night before our snapshot had run into errors. That's because it indeed was not unique the night before - just that the EL tool has come along and tidied this up before we arrived into the office for the day - and therefore we make the assumption that the raw data is always unique all the time. This will be difficult to track because the nature of snapshots is that raw data is always changing so our best bet is to make a backup of the raw data prior to snapshotting it - so that as we come into the office the next day, we can double check the condition of your raw data without it being affected by the nightly EL run.
39+
40+
### The snapshot is running in parallel / concurrently
41+
42+
If we are running multiple jobs that have the snapshot in them, there is a chance that both jobs coincide with one another and they both attempt to run the snapshot at the exact same time - or close to the exact same time. Because of this, there will be two `merge into <your snapshot table> using ...` statements that run in close proximity to another inserting the same data twice. If we have many dbt jobs, it can be difficult to easily identify which two (or more) job runs may be responsible for running our snapshots in parallel - in those cases, the query history logs provided by our datawarehouse could come in handy.
43+
44+
### There is a data type mismatch between the existing snapshot and the incoming raw data
45+
46+
This is a really tricky one but essentially you have a column that exist in the snapshot that is say of a `DATE` type but the incoming data is of a `TIMESTAMP` type - lets have a look at an example.
47+
48+
First we create our "raw" data:
49+
50+
```sql
51+
create or replace table development_jyeo.dbt_jyeo.raw as
52+
select 1 id, 'alice' as first_name, '1970-01-01 01:01:01'::timestamp as updated_at
53+
```
54+
55+
Then we create a snapshot that uses that raw data:
56+
57+
```sql
58+
{% snapshot snappy %}
59+
{{ config(target_schema='dbt_jyeo', unique_key='id', strategy='check', check_cols='all') }}
60+
select id, first_name, updated_at::date as updated_at
61+
from development_jyeo.dbt_jyeo.raw
62+
{% endsnapshot %}
63+
```
64+
65+
We run our snapshot for the first time:
66+
67+
```sh
68+
$ dbt --debug snapshot
69+
03:09:26 1 of 1 START snapshot dbt_jyeo.snappy .......................................... [RUN]
70+
03:09:26 Re-using an available connection from the pool (formerly list_development_jyeo_dbt_jyeo, now snapshot.my_dbt_project.snappy)
71+
03:09:26 Began compiling node snapshot.my_dbt_project.snappy
72+
03:09:26 Began executing node snapshot.my_dbt_project.snappy
73+
03:09:26 Writing runtime sql for node "snapshot.my_dbt_project.snappy"
74+
03:09:26 Using snowflake connection "snapshot.my_dbt_project.snappy"
75+
03:09:26 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
76+
create or replace transient table development_jyeo.dbt_jyeo.snappy
77+
as
78+
(
79+
80+
select *,
81+
md5(coalesce(cast(id as varchar ), '')
82+
|| '|' || coalesce(cast(to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as varchar ), '')
83+
) as dbt_scd_id,
84+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_updated_at,
85+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_from,
86+
nullif(to_timestamp_ntz(convert_timezone('UTC', current_timestamp())), to_timestamp_ntz(convert_timezone('UTC', current_timestamp()))) as dbt_valid_to
87+
from (
88+
89+
select id, first_name, updated_at::date as updated_at
90+
from development_jyeo.dbt_jyeo.raw
91+
92+
) sbq
93+
94+
);
95+
03:09:26 Opening a new connection, currently in state closed
96+
03:09:28 SQL status: SUCCESS 1 in 2.0 seconds
97+
03:09:28 On snapshot.my_dbt_project.snappy: Close
98+
03:09:29 Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '748bc67e-f903-468a-926c-8d2b775fd7e5', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x16236f710>]}
99+
03:09:29 1 of 1 OK snapshotted dbt_jyeo.snappy .......................................... [SUCCESS 1 in 2.31s]
100+
```
101+
102+
Our snapshot is created successfully and when we query our snapshot - things look good:
103+
104+
![alt text](image.png)
105+
106+
Now, we don't change our raw data but now we change our snapshot code slightly:
107+
108+
```sql
109+
{% snapshot snappy %}
110+
{{ config(target_schema='dbt_jyeo', unique_key='id', strategy='check', check_cols='all') }}
111+
select id, first_name, updated_at
112+
from development_jyeo.dbt_jyeo.raw
113+
{% endsnapshot %}
114+
```
115+
116+
Here we removed the casting of the `updated_at` column to `date` - perhaps because we saw that snapshot had already contained a `date` type or some other reason.
117+
118+
We then resnapshot:
119+
120+
```sh
121+
$ dbt --debug snapshot
122+
03:22:10 1 of 1 START snapshot dbt_jyeo.snappy .......................................... [RUN]
123+
03:22:10 Re-using an available connection from the pool (formerly list_development_jyeo_dbt_jyeo, now snapshot.my_dbt_project.snappy)
124+
03:22:10 Began compiling node snapshot.my_dbt_project.snappy
125+
03:22:10 Began executing node snapshot.my_dbt_project.snappy
126+
03:22:10 Using snowflake connection "snapshot.my_dbt_project.snappy"
127+
03:22:10 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
128+
select * from (
129+
130+
select id, first_name, updated_at
131+
from development_jyeo.dbt_jyeo.raw
132+
133+
) as __dbt_sbq
134+
where false
135+
limit 0
136+
03:22:10 Opening a new connection, currently in state closed
137+
03:22:12 SQL status: SUCCESS 0 in 1.0 seconds
138+
03:22:12 Using snowflake connection "snapshot.my_dbt_project.snappy"
139+
03:22:12 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
140+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY"
141+
03:22:12 SQL status: SUCCESS 7 in 0.0 seconds
142+
03:22:12 Using snowflake connection "snapshot.my_dbt_project.snappy"
143+
03:22:12 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
144+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY"
145+
03:22:12 SQL status: SUCCESS 7 in 0.0 seconds
146+
03:22:12 Using snowflake connection "snapshot.my_dbt_project.snappy"
147+
03:22:12 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
148+
create or replace temporary table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY__dbt_tmp"
149+
as
150+
(with snapshot_query as (
151+
152+
select id, first_name, updated_at
153+
from development_jyeo.dbt_jyeo.raw
154+
155+
),
156+
157+
snapshotted_data as (
158+
159+
select *,
160+
id as dbt_unique_key
161+
162+
from "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY"
163+
where dbt_valid_to is null
164+
165+
),
166+
167+
insertions_source_data as (
168+
169+
select
170+
*,
171+
id as dbt_unique_key,
172+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_updated_at,
173+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_from,
174+
nullif(to_timestamp_ntz(convert_timezone('UTC', current_timestamp())), to_timestamp_ntz(convert_timezone('UTC', current_timestamp()))) as dbt_valid_to,
175+
md5(coalesce(cast(id as varchar ), '')
176+
|| '|' || coalesce(cast(to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as varchar ), '')
177+
) as dbt_scd_id
178+
179+
from snapshot_query
180+
),
181+
182+
updates_source_data as (
183+
184+
select
185+
*,
186+
id as dbt_unique_key,
187+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_updated_at,
188+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_from,
189+
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_to
190+
191+
from snapshot_query
192+
),
193+
194+
insertions as (
195+
196+
select
197+
'insert' as dbt_change_type,
198+
source_data.*
199+
200+
from insertions_source_data as source_data
201+
left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
202+
where snapshotted_data.dbt_unique_key is null
203+
or (
204+
snapshotted_data.dbt_unique_key is not null
205+
and (
206+
(snapshotted_data."ID" != source_data."ID"
207+
or
208+
(
209+
((snapshotted_data."ID" is null) and not (source_data."ID" is null))
210+
or
211+
((not snapshotted_data."ID" is null) and (source_data."ID" is null))
212+
) or snapshotted_data."FIRST_NAME" != source_data."FIRST_NAME"
213+
or
214+
(
215+
((snapshotted_data."FIRST_NAME" is null) and not (source_data."FIRST_NAME" is null))
216+
or
217+
((not snapshotted_data."FIRST_NAME" is null) and (source_data."FIRST_NAME" is null))
218+
) or snapshotted_data."UPDATED_AT" != source_data."UPDATED_AT"
219+
or
220+
(
221+
((snapshotted_data."UPDATED_AT" is null) and not (source_data."UPDATED_AT" is null))
222+
or
223+
((not snapshotted_data."UPDATED_AT" is null) and (source_data."UPDATED_AT" is null))
224+
))
225+
)
226+
)
227+
228+
),
229+
230+
updates as (
231+
232+
select
233+
'update' as dbt_change_type,
234+
source_data.*,
235+
snapshotted_data.dbt_scd_id
236+
237+
from updates_source_data as source_data
238+
join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
239+
where (
240+
(snapshotted_data."ID" != source_data."ID"
241+
or
242+
(
243+
((snapshotted_data."ID" is null) and not (source_data."ID" is null))
244+
or
245+
((not snapshotted_data."ID" is null) and (source_data."ID" is null))
246+
) or snapshotted_data."FIRST_NAME" != source_data."FIRST_NAME"
247+
or
248+
(
249+
((snapshotted_data."FIRST_NAME" is null) and not (source_data."FIRST_NAME" is null))
250+
or
251+
((not snapshotted_data."FIRST_NAME" is null) and (source_data."FIRST_NAME" is null))
252+
) or snapshotted_data."UPDATED_AT" != source_data."UPDATED_AT"
253+
or
254+
(
255+
((snapshotted_data."UPDATED_AT" is null) and not (source_data."UPDATED_AT" is null))
256+
or
257+
((not snapshotted_data."UPDATED_AT" is null) and (source_data."UPDATED_AT" is null))
258+
))
259+
)
260+
)
261+
262+
select * from insertions
263+
union all
264+
select * from updates
265+
266+
);
267+
03:22:13 SQL status: SUCCESS 1 in 1.0 seconds
268+
03:22:13 Using snowflake connection "snapshot.my_dbt_project.snappy"
269+
03:22:13 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
270+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY__dbt_tmp"
271+
03:22:13 SQL status: SUCCESS 9 in 0.0 seconds
272+
03:22:13 Using snowflake connection "snapshot.my_dbt_project.snappy"
273+
03:22:13 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
274+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY"
275+
03:22:14 SQL status: SUCCESS 7 in 0.0 seconds
276+
03:22:14 Using snowflake connection "snapshot.my_dbt_project.snappy"
277+
03:22:14 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
278+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY__dbt_tmp"
279+
03:22:14 SQL status: SUCCESS 9 in 0.0 seconds
280+
03:22:14 Using snowflake connection "snapshot.my_dbt_project.snappy"
281+
03:22:14 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
282+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY"
283+
03:22:14 SQL status: SUCCESS 7 in 0.0 seconds
284+
03:22:14 Using snowflake connection "snapshot.my_dbt_project.snappy"
285+
03:22:14 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
286+
describe table "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY__dbt_tmp"
287+
03:22:14 SQL status: SUCCESS 9 in 0.0 seconds
288+
03:22:14 Writing runtime sql for node "snapshot.my_dbt_project.snappy"
289+
03:22:14 Using snowflake connection "snapshot.my_dbt_project.snappy"
290+
03:22:14 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
291+
BEGIN
292+
03:22:15 SQL status: SUCCESS 1 in 0.0 seconds
293+
03:22:15 Using snowflake connection "snapshot.my_dbt_project.snappy"
294+
03:22:15 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
295+
merge into "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY" as DBT_INTERNAL_DEST
296+
using "DEVELOPMENT_JYEO"."DBT_JYEO"."SNAPPY__dbt_tmp" as DBT_INTERNAL_SOURCE
297+
on DBT_INTERNAL_SOURCE.dbt_scd_id = DBT_INTERNAL_DEST.dbt_scd_id
298+
299+
when matched
300+
and DBT_INTERNAL_DEST.dbt_valid_to is null
301+
and DBT_INTERNAL_SOURCE.dbt_change_type in ('update', 'delete')
302+
then update
303+
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
304+
305+
when not matched
306+
and DBT_INTERNAL_SOURCE.dbt_change_type = 'insert'
307+
then insert ("ID", "FIRST_NAME", "UPDATED_AT", "DBT_UPDATED_AT", "DBT_VALID_FROM", "DBT_VALID_TO", "DBT_SCD_ID")
308+
values ("ID", "FIRST_NAME", "UPDATED_AT", "DBT_UPDATED_AT", "DBT_VALID_FROM", "DBT_VALID_TO", "DBT_SCD_ID")
309+
310+
;
311+
03:22:15 SQL status: SUCCESS 2 in 1.0 seconds
312+
03:22:15 Using snowflake connection "snapshot.my_dbt_project.snappy"
313+
03:22:15 On snapshot.my_dbt_project.snappy: /* {"app": "dbt", "dbt_version": "1.8.3", "profile_name": "all", "target_name": "sf", "node_id": "snapshot.my_dbt_project.snappy"} */
314+
COMMIT
315+
03:22:16 SQL status: SUCCESS 1 in 1.0 seconds
316+
03:22:16 On snapshot.my_dbt_project.snappy: Close
317+
03:22:16 Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '2186ba81-1dde-4ef5-a6ba-0335d4cd78cd', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x126c93c10>]}
318+
03:22:16 1 of 1 OK snapshotted dbt_jyeo.snappy .......................................... [SUCCESS 2 in 5.80s]
319+
```
320+
321+
Now when we query our snapshot:
322+
323+
![alt text](image-1.png)
324+
325+
We can see we have added a row that is pretty much identical in their dimensions (`first_name`, `updated_at`). In fact, everytime we snapshot, we will be introducing a new row of the exact same data - for example, after 2 more snapshot operations:
326+
327+
![alt text](image-2.png)
328+
329+
The reason for this is the query that compares the columns:
330+
331+
```sql
332+
snapshotted_data."UPDATED_AT" != source_data."UPDATED_AT"
333+
```
334+
335+
We're comparing the `updated_at` columns of the selection in the snapshot code (raw data) to the column that already exist currently in the snapshot and as we can see from this simple query:
336+
337+
```sql
338+
select '1970-01-01 01:01:01'::timestamp = '1970-01-01 01:01:01'::date is_identical
339+
-- FALSE
340+
```
341+
342+
That they are not equivalent. Because they are not equivalent, the incoming data appear to be new and distinct from the old record. And during the merge, the `timestamp` data type was neatly truncated into a `date` data type - resulting in a brand new record in the snapshot table that looks identical to all those before it.
343+
344+
> The `timestamp` > `date` truncation can be seen in this very quick example:
345+
> ```sql
346+
> create table test_insert as select '1970-01-01'::date as c;
347+
> insert into test_insert values ('1970-01-01 01:01:01'::timestamp);
348+
> select * from test_insert;
349+
> -- 1970-01-01, 1970-01-01
350+
> ```
351+
352+
Note that this example isn't exclusive to `date/timestamp` types but also for other data types like `float/decimal` types. Try snapshotting this for the first time:
353+
354+
```sql
355+
{% snapshot snappy %}
356+
{{ config(target_schema='dbt_jyeo', unique_key='id', strategy='check', check_cols='all') }}
357+
select 1 as id, 1.11 as c
358+
{% endsnapshot %}
359+
```
360+
361+
Modify the snapshot to be:
362+
363+
```sql
364+
{% snapshot snappy %}
365+
{{ config(target_schema='dbt_jyeo', unique_key='id', strategy='check', check_cols='all') }}
366+
select 1 as id, 1.111 as c
367+
{% endsnapshot %}
368+
```
369+
370+
And then snapshot again. The outcome will be similar to the above - a seemingly new row with the exact same data:
371+
372+
![alt text](image-3.png)

snapshot-duplicates/image-1.png

25.5 KB
Loading

snapshot-duplicates/image-2.png

43.6 KB
Loading

snapshot-duplicates/image-3.png

23 KB
Loading

snapshot-duplicates/image.png

17 KB
Loading

0 commit comments

Comments
 (0)