-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
bugSomething isn't workingSomething isn't working
Description
An edge case has come up where two things conspired to produce duplicates in the model:
- Some users' data have different session_ids at the same time (we think because of a race condition between tabs)
- Latency between device and collector results in exactly the same derived_tstamp for these events
These two factors mean that a user has two sessions with exactly the same start_tstamp, but different domain_sessionid's - and these happen to be the first tstamps for the user.
This produces duplicates in the users table when we join on start_tstamp:
data-models/web/v1/redshift/sql-runner/sql/standard/04-users/01-main/06-users.sql
Line 86 in a38e76b
AND a.start_tstamp = b.start_tstamp |
The same issue may exist when we join on end_tstamp for aggregates.
This seems very rare, but we should introduce some means of breaking a tie in the case where derived_tstamps happen to evaluate to exactly the same thing.
event 1:
"collector_tstamp": "2021-06-25 16:51:55.559 UTC",
"dvce_sent_tstamp": "2021-06-25 16:51:54.919 UTC",
"dvce_created_tstamp": "2021-06-25 16:51:54.911 UTC",
"derived_tstamp": "2021-06-25 16:51:55.551 UTC",
event 2:
"collector_tstamp": "2021-06-25 16:51:55.557 UTC",
"dvce_sent_tstamp": "2021-06-25 16:51:55.077 UTC",
"dvce_created_tstamp": "2021-06-25 16:51:55.071 UTC",
"derived_tstamp": "2021-06-25 16:51:55.551 UTC",
(first reported on ZD ticket 27522)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
awoehrl commentedon Jul 9, 2021
Hi,
just to add another data point:
I just ran the model on 100 days of our data and this happened with two user_ids. Both have two session ids, but exactly the same start_time.
Cheers
bill-warner commentedon Sep 7, 2021
Raised the same issue on the dataform model. I have included a proposed solution there which I think we should also be able to use on the SQL runner version.
snowplow-archive/dataform-data-models#10