Skip to content

Conversation

@jwhartley
Copy link

@jwhartley jwhartley commented Mar 17, 2025

Made materialization error more specific: key needs to be required and non-nullable, or have a default value


This change is Reviewable

willdonnelly and others added 30 commits February 27, 2025 13:54
…erations

Adds information to the error and logging for failed table operations to assist
in their resolution.
Contacts contain multiple sub-resources, like tags. There can be a large
number of sub-resources attached to one contact, but only a subset are
included when fetching a contact from Intercom's API. For example, only
the first 10 tags are included in a contact. To get all sub-resources,
the connector would need to make an additional request to fetch them
then hydrate the contact with the full set of sub-resources.

This commit adds this hydration functionality for tags within a contact.
There are other possible sub-resources that we could hydrate (notes,
companies, opted_out_subscription_types, etc.), but users are only
asking about tags right now. So I'm only adding functionality for tags.
Adding support for hydrating other sub-resources should be fairly
straightforward if we need to do so later.
This reverts commit e57acbc
and un-fixes estuary#2383

Turns out the particular function we're using here doesn't
work on a standby replica, which is...bad.
Modify GraphQL and models to be more flexible:
    - Update USERS GraphQL query to support pagination
    - Remove strict field definitions in models
    - Simplify model structures for Tag, Team, User, and Item
    - Adjust type hints and remove unnecessary validators
Extends the SSH client config with two more key-exchange algorithms
which golang.org/x/crypto/ssh supports but doesn't include in its
default list. This means we support all currently-supported ones,
but it's been written like this so that if the package itself adds
new ones in the future we don't have to explicitly update our code.
Makes it so a setting like `do_foo, do_bar` parses the same as
`do_foo,do_bar`. Without this fix it instead parses as a flag
setting `" do_bar"` which is clearly very distinct from the
`"do_bar"` flag.
Currently source-postgres captures array columns as a JSON object with
dimensions and elements properties, each of which is an array. This is
because PostgreSQL arrays are inherently multidimensional, and we're
trying to preserve them as faithfully as possible.

But it turns out that nobody ever wants that, they just want simple,
boring one-dimensional arrays of values to translate into a JSON
array of equivalent values. So we should do that by default going
forward.
The main purpose of thise flag is so that when doing a bulk flag
flip in production, one can go and re-pull just tasks that still
needh to be modified.

And my reasoning is that if I'm going and setting a flag like
`no_foobar` in prod, there are three cases:
 - A task has no feature flag setting for `foobar`. These are
   the ones we want to modify.
 - A task has setting `no_foobar`. These don't need to be touched
   since they already have the value we want.
 - A task has setting `foobar`. These _really_ shouldn't be
   touched since somebody has explicitly indicated that they
   want the opposite of what we're doing.

Previously the 'missing flag' logic was written such that one
might go `--missing=no_foobar` to get all tasks without the
specific flag setting `no_foobar` in their feature flags. And
since that's still the typical desire and it's weird to have
to write `--missing=foobar` and then `--set_flag=no_foobar`
for different commands and one could easily screw that up,
the logic also normalizes so `--missing=no_foobar` and
`--missing=foobar` do the same thing.
Another feature-flagged default change. This one will make new
captures going forward discover array columns as just the JSON
array of flattened element values.

As with other changes of this kind, this will only be merged
after I have set the corresponding `no_<flag>` on all tasks in
production to preserve current behavior.
…ics for custom reports

Users have quickly run into errors when creating custom reports. We
weren't validating the dimensions or metrics specified in custom reports,
so the connector was failing after it was published. This commit adds
validation of all dimensions and metrics specified in any custom reports.
The Google API has a handy `/metadata` endpoint that returns all valid
dimensions & metrics, so that list is used when validating the user's
custom report input.

This validation does NOT extend to any filters included in custom
reports. We need to have a better model for the various types of valid
filters before we even think about validating their contents, and
developing that model would take a lot more time than I think it's worth
investing right now.
…data

Turns out that sometimes those other fields in the `MetadataResponse` aren't present and the connector is failing validation when they aren't present. This commit strips `MetadataResponse` to only include the fields we need.
This adds a new script which is a lightweight wrapper around
the `list-tasks`, `bulk-config-editor`, and `bulk-publish`
scripts to make it even easier to set a `no_foobar` flag on
all tasks using a particular set of connectors. Example:

    $ ./feature-flag-automation.sh no_keyless_row_id source-postgres-batch source-mysql-batch source-redshift-batch source-bigquery-batch
    Setting flag 'no_keyless_row_id' on connectors: source-postgres-batch source-mysql-batch source-redshift-batch source-bigquery-batch
    Working directory: specs_no_keyless_row_id_20250304_114701
    Plan:

        ./list-tasks.sh --connector=source-postgres-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701
        ./list-tasks.sh --connector=source-mysql-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701
        ./list-tasks.sh --connector=source-redshift-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701
        ./list-tasks.sh --connector=source-bigquery-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701
        ./bulk-config-editor.sh --set_flag=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701
        ./bulk-publish.sh --mark --dir=specs_no_keyless_row_id_20250304_114701

    Proceed? (y/N)

This script is even more specialized for the specific case of the
thing I've had to do like half a dozen times now, but it makes
that operation very easy indeed and reduces the chance of messing
things up due to a typo or copy-paste error in one command.
This adds a connector startup check to verify that the resume
LSN is less than or equal to the current server WAL LSN. This
should be the case 100% of the time except when we suddenly get
pointed at a different database entirely, or if the WAL gets
deleted and restarts from zero as happens in a DB version upgrade.

This check is in warn-only mode for now, and I tweaked the wording
on the other "is this resume LSN consistent with the current state
of the server" check we do so they'll both include the substring
`"resume cursor mismatch"`. My plan is to check back in a week to
see if either warning triggered when it shouldn't have, and if
not we'll (finally) promote these checks to hard failures.
Eases the constraints on key locations for sql materializations in order to
allow materializations to roll up based on a subset of keyed locations. This is
used in the ops catalog in order to roll up events by catalog name (ignoring
shard key/rclock ranges) when materialize events.
I needed to get some better logs in order to troubleshoot an issue locally, and
it seemed useful enough to commit.
willdonnelly and others added 29 commits April 2, 2025 09:25
As described in the flag comment, when `tolerate_missed_changes`
is true the connector will tolerate missed changes in the CDC
stream and will not trigger an automatic re-backfill if changes
go missing. This may be useful if the CDC event data starts to
expire before it can be captured, but should generally only be
needed in exceptional circumstances when recovering from some
sort of major breakage.
This commit adds two hidden features for internal use:
- For all three captures, setting the "Skip Backfills" advanced
  option to `*.*` will cause all tables to immediately transition
  to active without any backfilling. This lets us avoid needing
  to modify the capture mode for a whole bunch of bindings.
- For all three captures, it is possible to put a setting like
  `initial_backfill_cursor=12345` into the feature flags list.
  If this is done, the value `12345` will be used as the initial
  cursor value when there is no resume cursor, instead of the
  normal behavior of taking the current end of the WAL.

As a reminder, this is what cursor values look like for each DB:

    // MySQL
    binlog.12345:7890

    // PostgreSQL
    123/45678901

    // SQL Server
    ALUrCQADl2AAGA==
Include the collection name in this error from Validate to help with finding a
resolution.
S3 table buckets now support an Iceberg REST API, and we can use that to
materialize to Iceberg tables in S3 table buckets.

To do this we need to be able to specify the "signing name" for SigV4
authentication. For most users who use Glue as a catalog this will default to
'glue', but it needs to be set to 's3tables' to use an S3 table bucket.

Another peculiar quirk of S3 tables is that when dropping a table, you must
request a purge of the table. So there's a bit of implementation-specific code
creeping into our REST catalog client to accommodate this, but at least its only
for SigV4 authentication.

An end-to-end test configuration is included here, but it doesn't work fully. It
will create tables and run transactions against them just fine, but the iceberg
helper program that reads the table results will need updated to read from S3
tables. I think that will be possible once we are able to update to use the
latest version of duckdb, so I will try to do that later.
Bumps [github.com/golang-jwt/jwt/v5](https://github.com/golang-jwt/jwt) from 5.2.1 to 5.2.2.
- [Release notes](https://github.com/golang-jwt/jwt/releases)
- [Changelog](https://github.com/golang-jwt/jwt/blob/main/VERSION_HISTORY.md)
- [Commits](golang-jwt/jwt@v5.2.1...v5.2.2)

---
updated-dependencies:
- dependency-name: github.com/golang-jwt/jwt/v5
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
This is like the 'initial_backfill_cursor=XYZ' flag hackery, but
it doesn't even require that you perform a backfill to trigger it.
This should generally be set, and then you watch it take effect,
and then you immediately publish another change to unset it again.
This commit introduces the Chargebee native connector, including the necessary resource schemas, API integration, and configuration files.

This foundational setup enables further development and testing of the Chargebee integration. There are multiple TODO items to revisit after this initial release.
Data from Mixpanel sometimes inserts `�` bytes in response data. This
has happened before in the `insert_id` property name (75c33a2)
, and evidently it also happens within records. The
`response.iter_lines(decode_unicode=True)` interprets `�` as a line break,
and the connector tries to yield separate parts of a record.

Airbyte must have encountered this & attempted to address this with the
`parts`-related logic in `iter_dicts`. But they didn't anticipate a single
part being valid JSON. For example, if the record has a property like:
```
{
    "description": "Sing
99
Luftballons."
}
```

Then the `99` part will be seen as valid JSON, the connector will try to
transform & yield it, then fail with a `TypeError``:
```
TypeError: 'int' object is not subscriptable
```

I've improved the handling so `iter_dicts` only yields actual, dictionary
records that contain the two properties that should always be present in
every `export` record.
…failures

The apply action that fails will cause a context cancellation, and all other
actions will then fail due to that context being cancelled. This causes a bunch
of log spam and makes it very difficult to see which action it was that failed
for something other than a context cancellation.

This change will make it so only the erroring action will log out its action.
The `http.DefaultClient` uses chunked transfer encoding when Content-Type header is not set & the request body doesn't have a set length (ex: an `*io.PipeWriter`). Some webhook destinations, like Azure Logic Apps, don't work well with chunked transfer encoding. This commit reverts the connector's behavior to not stream the request body as chunks & instead send the request all at once. Request bodies are maxed at ~1 MiB before they are sent.
…hargebeeResource

Modified the date filtering logic in the _fetch_resource_data function to utilize the cursor field specific to IncrementalChargebeeResource.
I've seen a handful of imported connectors hit the 6 hour session
timeout limit recently. Reducing the timeout to 2 hours should help
reduce the impact when this happens. I'd like to reduce it even further,
but some imported connectors (at least source-mixpanel-native and its
export stream) do take over an hour to complete a single request when it's
streaming in a lot of records.
This information was already available from the MongoDB timestamps, but make it
easier to see at a glance how far behind a given change stream is at processing
events vs. the latest cluster operation time.
…database change streams

Sometimes a MongoDB collection is not being captured, but is part of a database
that is being captured. Normally this isn't a problem if the number and/or size
of change events for that MongoDB collection are small, but it can be
problematic otherwise & actually bog down the change stream processing.

This adds an advanced configuration option for these cases to allow MongoDB
collections to be exclude from the database change stream by way of a pipeline
filter.
This commit implements a new fencing mechanism for SQL Server
which should work against a read replica. The new implementation
is used automatically when the target DB is detected as being a
read-only standby, or it can be explicitly forced on using the
`replica_fencing` feature flag.

This mechanism has a minor downside, which is that it doesn't
(and can't, in every possible scenario) guarantee that when a
row is modified concurrently with a backfill, the backfilled
row state and the concurrent changes from replication will be
captured in the same Flow transaction.

Other than this caveat the capture output is still entirely
correct, so any users with standard reduction behavior into
a materialization wouldn't even be able to notice a difference,
but it's just a _tiny_ bit weaker of a correctness guarantee
compared to what our SQL CDC connectors usually offer.

But I have proven to my own satisfaction that it is literally
impossible to make the more precise guarantee in all situations
when looking solely at the available data on a read replica, so
if we want to support read replica CDC for SQL Server at all
this is the best we're going to be able to do. And we do want
to support that.
In tests we're actually much more demanding than production uses
are of our capture's ability to catch up to the end of the WAL in
a single streaming cycle, because we have test-only logic which
tells the connector to shut down as soon as it's caught up.

But since the replica fence mechanism can't directly observe the
end of the WAL (since, again, there is no way for us to actually
get that information which works on a read replica) we have to
somehow ensure that we check _after_ long enough has passed for
the CDC agent to run and copy the data into the CDC tables.

The simplest solution is to just add a short delay before running
each test capture. This is actually pretty motivated in general
in the context of replica captures -- if we're testing against
a read replica there is a nonzero amount of latency between when
we make a change on the master and when it shows up for us to see
on the replica, so we need to wait a bit after making changes and
before expecting a capture to see them.

Currently this delay is only added for SQL Server and only when
the `replica_fencing` flag is explicitly enabled in a test. This
seemed reasonable since `replica_fencing` isn't the default today
for most production captures, and we can only really test one or
the other mechanism in CI builds, so manually testing the feature
currently works by setting that flag.
I was fiddling around in our dev Intercom account when I was trying to replicate an issue, and I ended up modifying some of the data. This commit updates the capture snapshot to reflect the conversation parts I added to our dev Intercom account.
`conversation_parts` works by fetching parts of updated conversations,
then only yielding the parts that have been updated since the last
log cursor. This was done to avoid yielding conversation parts that were
previously yielded. However, this strategy has potential to miss data if
another part is added to a conversation that's beyond the date window
the connector is currently checking & the conversation is updated again
before the connector checks the next date window.

For example, say there's a conversation `123` that had a part added to
it at `1742999957`. The connector is checking the window of `1742999856`
- `1742999956`. Before the connector finishes that date window,
conversation `123` is updated again and it's `updated_at` is pushed to a
later time `1743000057`.

Next, the connector checks the date window `1742999956` - `1743000056`.
Conversation `123` is not included in this date window, so it's not
checked for parts updated after `1742999956`.

Next, the connector checks the date window `1743000056` - `1743000356`.
Conversation `123` is included, but the part that was added at
`1742999957` is not yielded because it was updated before the most
recent log cursor.

This commit greatly simplifies this logic for `conversation_parts` and
aligns its checkpointing strategy with `conversations`. If we detect an
updated conversation, we yield all of its parts no matter when they were
updated. This should ensure the connector captures all parts for updated
conversations.
If the `isReplicaDatabase()` function encounters an error, it will
now assume that this is not a replica instead of erroring out. The
main reason we're likely to see this is on databases which don't
even have a `sys.dm_hadr_database_replica_states` table, so erring
on the side of assuming it's the primary DB should always be fine.
As currently implemented, XMIN backfills have a problem.

The fundamental issue with XMIN backfills as they exist today is that we
literally just slap in a `WHERE {xmin is within the specified range} filter or
filters into the backfill queries without changing anything else about our
behavior. And the problem with this is that our backfill logic is built around
the assumption that we are getting result rows back.

This turns out to not be the case when doing an XMIN backfill of a large table
(the only kind worth using XMIN backfills on) whose primary keys are mostly
ordered (that is, most writes will be at the very end of the table).

What happens in this case is we tell the DB "hey can you give me any rows newer
than {min_xid} and stop after N" and the DB says "Sure, I can do that!" and
then it goes off and scans over the entire table and if we're lucky it
eventually finishes that, many hours later, and gives us back the first N new
rows, after which point we can start issuing much faster queries to read out
the rest of the new data after those ones.

But if anything interrupts that incremental scan, we have no recourse but to
start all over again from the beginning.

And this is clearly dumb, because our backfill queries proceed according to
primary key (or CTID) order, so if we _knew_ how far the scan had gotten we
could tell the DB to start from that point. But the only way we learn about
that progress is when we get back a result row, so that doesn't work right when
the result set is heavily filtered.

This commit removes the DB-side filtering in favor of doing that filtering on
the client side, and implements a bit of additional plumbing so that the
backfill implementation can keep track of a "resume from here" key separate
from any actual ChangeEvents it outputs.
We have a ton of fancy logic in SQL Server for attempting to
encode text strings such that the encoded FDB tuples sort in a
lexicographic order which actually matches the source DB's
collation ordering of those strings.

And that mostly works! Except apparently it has a subtle flaw we
completely overlooked for like a year or two now. See, we treat
the encoded "sort key" as a one-way function of the input, so we
encode a string as a tuple (<tag>, <sort key>, <original string>)
and then throw away the sorting key and use the original string
as the key on decode. And this works, as long as you assume that
strings are necessarily either less than or greater than each
other. Which is mostly the case for primary keys!

But it's not the case if the string is a non-final element of a
composite primary key. In that case you could have two strings
which are considered equal under the source DB collation yet have
different representations. For instance "foo" and "FOO". And then
even though the encoded _sort keys_ are considered equal, the value
parts are unequal and that part comes before any _subsequent_ parts
of the composite primary key and screws up the overall key ordering.

And while it's possible to fix that, at this point I'm thinking that
we have invested way too much time and effort into SQL Server text
collation, and this is getting really dumb considering that MySQL
and PostgreSQL just treat all text collation ordering as unable to
be predicted and default to imprecise backfill semantics, and *we
have never had a single complaint about that*.

So let's just be consistent. SQL Server text primary keys will also
cause us to default to imprecise backfill order.
…alogs

This commit expands the connector's resource handling capabilities:

Resource Additions:
- Added new Pydantic models and bindings for:
  * Comments (with incremental sync support via created_at cursor)
  * Contacts
  * Gifts
  * Promotional Credits
  * Quote Line Groups (requires Performance/Enterprise plan)
  * Subscriptions with Scheduled Changes
  * Unbilled Charges

Configuration Improvements:
- Updated resource configurations to support both Product Catalog 1.0 and 2.0
- Enhanced validation logic for resource-specific configurations
- Added proper type hints and validation rules for all new schemas

API Integration Enhancements:
- Improved handling of associated resources with parent-child relationships
- Added support for filtering in associated resource queries
- Implemented proper cursor handling for incremental syncs

This enhancement provides broader coverage of Chargebee's API capabilities while
maintaining proper error handling through the API's native responses.
We only need this prerequisite check when in read-only non-replica
mode and it's incorrect to let that break a replica capture.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants