Skip to content

Break Ties when joining on timestamps #106

@colmsnowplow

Description

@colmsnowplow
Collaborator

An edge case has come up where two things conspired to produce duplicates in the model:

  1. Some users' data have different session_ids at the same time (we think because of a race condition between tabs)
  2. Latency between device and collector results in exactly the same derived_tstamp for these events

These two factors mean that a user has two sessions with exactly the same start_tstamp, but different domain_sessionid's - and these happen to be the first tstamps for the user.

This produces duplicates in the users table when we join on start_tstamp:

The same issue may exist when we join on end_tstamp for aggregates.

This seems very rare, but we should introduce some means of breaking a tie in the case where derived_tstamps happen to evaluate to exactly the same thing.

event 1:
    "collector_tstamp": "2021-06-25 16:51:55.559 UTC",
    "dvce_sent_tstamp": "2021-06-25 16:51:54.919 UTC",
    "dvce_created_tstamp": "2021-06-25 16:51:54.911 UTC",
    "derived_tstamp": "2021-06-25 16:51:55.551 UTC",

event 2:
    "collector_tstamp": "2021-06-25 16:51:55.557 UTC",
    "dvce_sent_tstamp": "2021-06-25 16:51:55.077 UTC",
    "dvce_created_tstamp": "2021-06-25 16:51:55.071 UTC",
    "derived_tstamp": "2021-06-25 16:51:55.551 UTC",

(first reported on ZD ticket 27522)

Activity

awoehrl

awoehrl commented on Jul 9, 2021

@awoehrl

Hi,

just to add another data point:
I just ran the model on 100 days of our data and this happened with two user_ids. Both have two session ids, but exactly the same start_time.

Cheers

bill-warner

bill-warner commented on Sep 7, 2021

@bill-warner
Contributor

Raised the same issue on the dataform model. I have included a proposed solution there which I think we should also be able to use on the SQL runner version.

snowplow-archive/dataform-data-models#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @colmsnowplow@awoehrl@bill-warner

        Issue actions

          Break Ties when joining on timestamps · Issue #106 · snowplow/data-models