Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental replication doesn't respect the current state #196

Open
emishas opened this issue Apr 27, 2023 · 1 comment
Open

Incremental replication doesn't respect the current state #196

emishas opened this issue Apr 27, 2023 · 1 comment
Assignees

Comments

@emishas
Copy link

emishas commented Apr 27, 2023

The tap doesn't respect existing replication state by filter out data older than the replication key value in the state.

How to reproduce

Github tap configuration

  - name: tap-github-repos
    inherit_from: tap-github
    pip_url: git+https://github.com/MeltanoLabs/tap-github.git
    config:
      user_agent: ''
      start_date: '2023-01-01T00:00:00Z'
      searches:
      - name: All repos
        query: apache/*
    variant: meltanolabs
    select:
    - repositories.*
    metadata:
      repositories:
        replication-method: INCREMENTAL

Run a sync that produces 1000 (limit for the 'repositories' stream) records and a state record.

meltano run tap-github-repos target-jsonl

Run the same sync one more time

meltano run tap-github-repos target-jsonl

Result is there are 2000 records in the target json file and each record is fully duplicated.

The issue can be reproduced on the repositories stream.
I couldn't reproduce this on the issues stream.
I haven't tested other streams.

If Github APIs do not allow fetching data from a specific replication point (at least for the repositories stream) then the tap should filter those records instead of sending them down the pipeline.

@laurentS
Copy link
Contributor

Hi @emishas thanks for reporting this. I'm not entirely sure how meltano calls the tap from the above config, as I don't use meltano at all, but my initial hunch is that the repositories stream is not really incremental. The stream returns a single record for a repo, at most (if the repo name does not exist, the tap skips it and returns nothing).

I'm not entirely sure that updated_at, which is set as the replication key for the stream, is actually updated for all changes of the record.

Reading the docs:

Using the state, the tap should then skip returning rows where the replication key comes strictly before than previous maximal replication key value stored in the state.

As you only have a single record for each repo, the newer record value for updated_at cannot be strictly before itself. So I think it makes sense that all records are returned by the tap.

Then my guess is that in your case the deduplication logic is incorrect for the stream, which would explain why all records end up being duplicated. There is no primary_key nor state_partitioning_keys on the stream, which might explain this. Can you share what (a couple of) your records look like, and what the state file content is? It might help understand what is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants