Skip to content

source-mongodb: tie-break with _id in batch incremental mode #3723

@danielnelson

Description

@danielnelson

Current Behavior

MongoDB timeseries collections do not support change streams, so they are processed using the batch incremental capture mode. They also do not typically have an index on the _id field, though one can be manually created.

Our connector has a cursorField option that is a string which is used to sort the results and limit them to the latest values.

This works okay if the values are strictly increasing on this cursorField, such as when using _id. But if they are not then values can be skipped anytime the cursor is recreated. This is because the recreated cursor selects values greater than the last seen value, so it won't capture any with the same value. The cursor is recreated if the connector is restarted and also for each batch incremental capture, in the case of a full backfill it can restart multiple times due to inferred schema updates.

While our documentation does say that the cursorField must be on a strictly increasing value, it seems that this is not the case for many databases. When timeseries collections do not meet this requirement the only option is to accept missing data or add an additional index on _id, which would hurt write performance.

Desired Behavior

This could be fixed if we had a tie-breaker field. The most unique one would be _id, which I believe is always set except in a rare case in MongoDB 3 and older. Although it is not indexed automatically in the case of timeseries collections, I think it could still be used as a secondary sort term.

This would decrease the performance of batch incremental mode on timeseries collections, at least somewhat. As I understand it, the performance hit would depend on how many values have an exactly equal cursorField. Other collection types typically have an index on _id since this is automatic on collection creation, or are already being sorted by this field, so performance impact should be minimal or none.

The query would change from something like:

db.mytime.find({ts: {$gt: ISODate("2026-01-06T01:17:20.633Z")}}).sort({ts: 1})

To:

db.mytime.find(
    {
        $or: [
            {ts: {$gt: ISODate("2026-01-06T01:17:20.633Z")}},
            {ts: ISODate("2026-01-06T01:17:20.633Z"), _id: {$gt: ObjectId('695c62ec03a4e4c239e9496c')}},
        ]
    }
).sort({ts: 1, _id: 1})

Another option, which could be implemented independently, is to allow multiple cursorFields and sort and limit on them both. This could add additional uniqueness so there aren't so many ties. For timeseries collections it would make sense to compare on timeField, then metaField, and then finally on _id. We could combine this with using _id so that this field is very rarely needed to break ties.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions