You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run a sync that produces 1000 (limit for the 'repositories' stream) records and a state record.
meltano run tap-github-repos target-jsonl
Run the same sync one more time
meltano run tap-github-repos target-jsonl
Result is there are 2000 records in the target json file and each record is fully duplicated.
The issue can be reproduced on the repositories stream.
I couldn't reproduce this on the issues stream.
I haven't tested other streams.
If Github APIs do not allow fetching data from a specific replication point (at least for the repositories stream) then the tap should filter those records instead of sending them down the pipeline.
The text was updated successfully, but these errors were encountered:
Hi @emishas thanks for reporting this. I'm not entirely sure how meltano calls the tap from the above config, as I don't use meltano at all, but my initial hunch is that the repositories stream is not really incremental. The stream returns a single record for a repo, at most (if the repo name does not exist, the tap skips it and returns nothing).
I'm not entirely sure that updated_at, which is set as the replication key for the stream, is actually updated for all changes of the record.
Using the state, the tap should then skip returning rows where the replication key comes strictly before than previous maximal replication key value stored in the state.
As you only have a single record for each repo, the newer record value for updated_at cannot be strictly before itself. So I think it makes sense that all records are returned by the tap.
Then my guess is that in your case the deduplication logic is incorrect for the stream, which would explain why all records end up being duplicated. There is no primary_key nor state_partitioning_keys on the stream, which might explain this. Can you share what (a couple of) your records look like, and what the state file content is? It might help understand what is happening.
The tap doesn't respect existing replication state by filter out data older than the replication key value in the state.
How to reproduce
Github tap configuration
Run a sync that produces 1000 (limit for the 'repositories' stream) records and a state record.
Run the same sync one more time
Result is there are 2000 records in the target json file and each record is fully duplicated.
The issue can be reproduced on the
repositories
stream.I couldn't reproduce this on the
issues
stream.I haven't tested other streams.
If Github APIs do not allow fetching data from a specific replication point (at least for the
repositories
stream) then the tap should filter those records instead of sending them down the pipeline.The text was updated successfully, but these errors were encountered: