Skip to content

[source-mongodb] Push binding-based $match immediately after $changeStream to reduce oplog load #3745

@prashar32

Description

@prashar32

File source mongodb source

The MongoDB source connector currently builds a change stream pipeline where:

  • $project is always applied
  • $match on collections is optional and only enabled when exclusiveCollectionFilter is set

As a result, MongoDB:

  • Reads all change events from the oplog
  • Serializes and sends events that the connector later discards
  • Performs unnecessary CPU, memory, and network work

Even when downstream bindings are scoped to specific collections, the server still processes:

  • Unrelated collections
  • Unused change events

Insert a $match stage immediately after $changeStream, derived from the known bindings (db + collection), and apply it by default, unless explicitly disabled.

Conceptually
It always restrict to bound collections unless explicitly disabled

if !allowAllCollections {
    $match on ns.db + ns.coll
}

Pipeline order:

  1. $changeStream
  2. $match <- NEW (early)
  3. $project

Why this is safe

  1. Does not change connector semantics
  2. Only removes events that would be ignored downstream anyway
  3. Uses existing binding metadata already known to the connector
  4. Matches CDC best practices (server-side filtering first)

Benefits

  1. Reduced oplog scanning work
  2. Lower MongoDB CPU and memory usage
  3. Reduced network traffic
  4. Lower connector-side processing cost

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions