Minor materialization error message change #2541

jwhartley · 2025-03-17T04:04:17Z

Made materialization error more specific: key needs to be required and non-nullable, or have a default value

This change is

…erations Adds information to the error and logging for failed table operations to assist in their resolution.

Contacts contain multiple sub-resources, like tags. There can be a large number of sub-resources attached to one contact, but only a subset are included when fetching a contact from Intercom's API. For example, only the first 10 tags are included in a contact. To get all sub-resources, the connector would need to make an additional request to fetch them then hydrate the contact with the full set of sub-resources. This commit adds this hydration functionality for tags within a contact. There are other possible sub-resources that we could hydrate (notes, companies, opted_out_subscription_types, etc.), but users are only asking about tags right now. So I'm only adding functionality for tags. Adding support for hydrating other sub-resources should be fairly straightforward if we need to do so later.

This reverts commit e57acbc and un-fixes estuary#2383 Turns out the particular function we're using here doesn't work on a standby replica, which is...bad.

Modify GraphQL and models to be more flexible: - Update USERS GraphQL query to support pagination - Remove strict field definitions in models - Simplify model structures for Tag, Team, User, and Item - Adjust type hints and remove unnecessary validators

Extends the SSH client config with two more key-exchange algorithms which golang.org/x/crypto/ssh supports but doesn't include in its default list. This means we support all currently-supported ones, but it's been written like this so that if the package itself adds new ones in the future we don't have to explicitly update our code.

Makes it so a setting like `do_foo, do_bar` parses the same as `do_foo,do_bar`. Without this fix it instead parses as a flag setting `" do_bar"` which is clearly very distinct from the `"do_bar"` flag.

Currently source-postgres captures array columns as a JSON object with dimensions and elements properties, each of which is an array. This is because PostgreSQL arrays are inherently multidimensional, and we're trying to preserve them as faithfully as possible. But it turns out that nobody ever wants that, they just want simple, boring one-dimensional arrays of values to translate into a JSON array of equivalent values. So we should do that by default going forward.

The main purpose of thise flag is so that when doing a bulk flag flip in production, one can go and re-pull just tasks that still needh to be modified. And my reasoning is that if I'm going and setting a flag like `no_foobar` in prod, there are three cases: - A task has no feature flag setting for `foobar`. These are the ones we want to modify. - A task has setting `no_foobar`. These don't need to be touched since they already have the value we want. - A task has setting `foobar`. These _really_ shouldn't be touched since somebody has explicitly indicated that they want the opposite of what we're doing. Previously the 'missing flag' logic was written such that one might go `--missing=no_foobar` to get all tasks without the specific flag setting `no_foobar` in their feature flags. And since that's still the typical desire and it's weird to have to write `--missing=foobar` and then `--set_flag=no_foobar` for different commands and one could easily screw that up, the logic also normalizes so `--missing=no_foobar` and `--missing=foobar` do the same thing.

Another feature-flagged default change. This one will make new captures going forward discover array columns as just the JSON array of flattened element values. As with other changes of this kind, this will only be merged after I have set the corresponding `no_<flag>` on all tasks in production to preserve current behavior.

…ics for custom reports Users have quickly run into errors when creating custom reports. We weren't validating the dimensions or metrics specified in custom reports, so the connector was failing after it was published. This commit adds validation of all dimensions and metrics specified in any custom reports. The Google API has a handy `/metadata` endpoint that returns all valid dimensions & metrics, so that list is used when validating the user's custom report input. This validation does NOT extend to any filters included in custom reports. We need to have a better model for the various types of valid filters before we even think about validating their contents, and developing that model would take a lot more time than I think it's worth investing right now.

…data Turns out that sometimes those other fields in the `MetadataResponse` aren't present and the connector is failing validation when they aren't present. This commit strips `MetadataResponse` to only include the fields we need.

This adds a new script which is a lightweight wrapper around the `list-tasks`, `bulk-config-editor`, and `bulk-publish` scripts to make it even easier to set a `no_foobar` flag on all tasks using a particular set of connectors. Example: $ ./feature-flag-automation.sh no_keyless_row_id source-postgres-batch source-mysql-batch source-redshift-batch source-bigquery-batch Setting flag 'no_keyless_row_id' on connectors: source-postgres-batch source-mysql-batch source-redshift-batch source-bigquery-batch Working directory: specs_no_keyless_row_id_20250304_114701 Plan: ./list-tasks.sh --connector=source-postgres-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701 ./list-tasks.sh --connector=source-mysql-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701 ./list-tasks.sh --connector=source-redshift-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701 ./list-tasks.sh --connector=source-bigquery-batch --pull --missing=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701 ./bulk-config-editor.sh --set_flag=no_keyless_row_id --dir=specs_no_keyless_row_id_20250304_114701 ./bulk-publish.sh --mark --dir=specs_no_keyless_row_id_20250304_114701 Proceed? (y/N) This script is even more specialized for the specific case of the thing I've had to do like half a dozen times now, but it makes that operation very easy indeed and reduces the chance of messing things up due to a typo or copy-paste error in one command.

This adds a connector startup check to verify that the resume LSN is less than or equal to the current server WAL LSN. This should be the case 100% of the time except when we suddenly get pointed at a different database entirely, or if the WAL gets deleted and restarts from zero as happens in a DB version upgrade. This check is in warn-only mode for now, and I tweaked the wording on the other "is this resume LSN consistent with the current state of the server" check we do so they'll both include the substring `"resume cursor mismatch"`. My plan is to check back in a week to see if either warning triggered when it shouldn't have, and if not we'll (finally) promote these checks to hard failures.

Eases the constraints on key locations for sql materializations in order to allow materializations to roll up based on a subset of keyed locations. This is used in the ops catalog in order to roll up events by catalog name (ignoring shard key/rclock ranges) when materialize events.

I needed to get some better logs in order to troubleshoot an issue locally, and it seemed useful enough to commit.

As described in the flag comment, when `tolerate_missed_changes` is true the connector will tolerate missed changes in the CDC stream and will not trigger an automatic re-backfill if changes go missing. This may be useful if the CDC event data starts to expire before it can be captured, but should generally only be needed in exceptional circumstances when recovering from some sort of major breakage.

This commit adds two hidden features for internal use: - For all three captures, setting the "Skip Backfills" advanced option to `*.*` will cause all tables to immediately transition to active without any backfilling. This lets us avoid needing to modify the capture mode for a whole bunch of bindings. - For all three captures, it is possible to put a setting like `initial_backfill_cursor=12345` into the feature flags list. If this is done, the value `12345` will be used as the initial cursor value when there is no resume cursor, instead of the normal behavior of taking the current end of the WAL. As a reminder, this is what cursor values look like for each DB: // MySQL binlog.12345:7890 // PostgreSQL 123/45678901 // SQL Server ALUrCQADl2AAGA==

Include the collection name in this error from Validate to help with finding a resolution.

S3 table buckets now support an Iceberg REST API, and we can use that to materialize to Iceberg tables in S3 table buckets. To do this we need to be able to specify the "signing name" for SigV4 authentication. For most users who use Glue as a catalog this will default to 'glue', but it needs to be set to 's3tables' to use an S3 table bucket. Another peculiar quirk of S3 tables is that when dropping a table, you must request a purge of the table. So there's a bit of implementation-specific code creeping into our REST catalog client to accommodate this, but at least its only for SigV4 authentication. An end-to-end test configuration is included here, but it doesn't work fully. It will create tables and run transactions against them just fine, but the iceberg helper program that reads the table results will need updated to read from S3 tables. I think that will be possible once we are able to update to use the latest version of duckdb, so I will try to do that later.

Bumps [github.com/golang-jwt/jwt/v5](https://github.com/golang-jwt/jwt) from 5.2.1 to 5.2.2. - [Release notes](https://github.com/golang-jwt/jwt/releases) - [Changelog](https://github.com/golang-jwt/jwt/blob/main/VERSION_HISTORY.md) - [Commits](golang-jwt/jwt@v5.2.1...v5.2.2) --- updated-dependencies: - dependency-name: github.com/golang-jwt/jwt/v5 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>

This is like the 'initial_backfill_cursor=XYZ' flag hackery, but it doesn't even require that you perform a backfill to trigger it. This should generally be set, and then you watch it take effect, and then you immediately publish another change to unset it again.

This commit introduces the Chargebee native connector, including the necessary resource schemas, API integration, and configuration files. This foundational setup enables further development and testing of the Chargebee integration. There are multiple TODO items to revisit after this initial release.

Data from Mixpanel sometimes inserts `�` bytes in response data. This has happened before in the `insert_id` property name (75c33a2) , and evidently it also happens within records. The `response.iter_lines(decode_unicode=True)` interprets `�` as a line break, and the connector tries to yield separate parts of a record. Airbyte must have encountered this & attempted to address this with the `parts`-related logic in `iter_dicts`. But they didn't anticipate a single part being valid JSON. For example, if the record has a property like: ``` { "description": "Sing 99 Luftballons." } ``` Then the `99` part will be seen as valid JSON, the connector will try to transform & yield it, then fail with a `TypeError``: ``` TypeError: 'int' object is not subscriptable ``` I've improved the handling so `iter_dicts` only yields actual, dictionary records that contain the two properties that should always be present in every `export` record.

…failures The apply action that fails will cause a context cancellation, and all other actions will then fail due to that context being cancelled. This causes a bunch of log spam and makes it very difficult to see which action it was that failed for something other than a context cancellation. This change will make it so only the erroring action will log out its action.

The `http.DefaultClient` uses chunked transfer encoding when Content-Type header is not set & the request body doesn't have a set length (ex: an `*io.PipeWriter`). Some webhook destinations, like Azure Logic Apps, don't work well with chunked transfer encoding. This commit reverts the connector's behavior to not stream the request body as chunks & instead send the request all at once. Request bodies are maxed at ~1 MiB before they are sent.

…hargebeeResource Modified the date filtering logic in the _fetch_resource_data function to utilize the cursor field specific to IncrementalChargebeeResource.

I've seen a handful of imported connectors hit the 6 hour session timeout limit recently. Reducing the timeout to 2 hours should help reduce the impact when this happens. I'd like to reduce it even further, but some imported connectors (at least source-mixpanel-native and its export stream) do take over an hour to complete a single request when it's streaming in a lot of records.

This information was already available from the MongoDB timestamps, but make it easier to see at a glance how far behind a given change stream is at processing events vs. the latest cluster operation time.

…database change streams Sometimes a MongoDB collection is not being captured, but is part of a database that is being captured. Normally this isn't a problem if the number and/or size of change events for that MongoDB collection are small, but it can be problematic otherwise & actually bog down the change stream processing. This adds an advanced configuration option for these cases to allow MongoDB collections to be exclude from the database change stream by way of a pipeline filter.

This commit implements a new fencing mechanism for SQL Server which should work against a read replica. The new implementation is used automatically when the target DB is detected as being a read-only standby, or it can be explicitly forced on using the `replica_fencing` feature flag. This mechanism has a minor downside, which is that it doesn't (and can't, in every possible scenario) guarantee that when a row is modified concurrently with a backfill, the backfilled row state and the concurrent changes from replication will be captured in the same Flow transaction. Other than this caveat the capture output is still entirely correct, so any users with standard reduction behavior into a materialization wouldn't even be able to notice a difference, but it's just a _tiny_ bit weaker of a correctness guarantee compared to what our SQL CDC connectors usually offer. But I have proven to my own satisfaction that it is literally impossible to make the more precise guarantee in all situations when looking solely at the available data on a read replica, so if we want to support read replica CDC for SQL Server at all this is the best we're going to be able to do. And we do want to support that.

In tests we're actually much more demanding than production uses are of our capture's ability to catch up to the end of the WAL in a single streaming cycle, because we have test-only logic which tells the connector to shut down as soon as it's caught up. But since the replica fence mechanism can't directly observe the end of the WAL (since, again, there is no way for us to actually get that information which works on a read replica) we have to somehow ensure that we check _after_ long enough has passed for the CDC agent to run and copy the data into the CDC tables. The simplest solution is to just add a short delay before running each test capture. This is actually pretty motivated in general in the context of replica captures -- if we're testing against a read replica there is a nonzero amount of latency between when we make a change on the master and when it shows up for us to see on the replica, so we need to wait a bit after making changes and before expecting a capture to see them. Currently this delay is only added for SQL Server and only when the `replica_fencing` flag is explicitly enabled in a test. This seemed reasonable since `replica_fencing` isn't the default today for most production captures, and we can only really test one or the other mechanism in CI builds, so manually testing the feature currently works by setting that flag.

I was fiddling around in our dev Intercom account when I was trying to replicate an issue, and I ended up modifying some of the data. This commit updates the capture snapshot to reflect the conversation parts I added to our dev Intercom account.

`conversation_parts` works by fetching parts of updated conversations, then only yielding the parts that have been updated since the last log cursor. This was done to avoid yielding conversation parts that were previously yielded. However, this strategy has potential to miss data if another part is added to a conversation that's beyond the date window the connector is currently checking & the conversation is updated again before the connector checks the next date window. For example, say there's a conversation `123` that had a part added to it at `1742999957`. The connector is checking the window of `1742999856` - `1742999956`. Before the connector finishes that date window, conversation `123` is updated again and it's `updated_at` is pushed to a later time `1743000057`. Next, the connector checks the date window `1742999956` - `1743000056`. Conversation `123` is not included in this date window, so it's not checked for parts updated after `1742999956`. Next, the connector checks the date window `1743000056` - `1743000356`. Conversation `123` is included, but the part that was added at `1742999957` is not yielded because it was updated before the most recent log cursor. This commit greatly simplifies this logic for `conversation_parts` and aligns its checkpointing strategy with `conversations`. If we detect an updated conversation, we yield all of its parts no matter when they were updated. This should ensure the connector captures all parts for updated conversations.

If the `isReplicaDatabase()` function encounters an error, it will now assume that this is not a replica instead of erroring out. The main reason we're likely to see this is on databases which don't even have a `sys.dm_hadr_database_replica_states` table, so erring on the side of assuming it's the primary DB should always be fine.

As currently implemented, XMIN backfills have a problem. The fundamental issue with XMIN backfills as they exist today is that we literally just slap in a `WHERE {xmin is within the specified range} filter or filters into the backfill queries without changing anything else about our behavior. And the problem with this is that our backfill logic is built around the assumption that we are getting result rows back. This turns out to not be the case when doing an XMIN backfill of a large table (the only kind worth using XMIN backfills on) whose primary keys are mostly ordered (that is, most writes will be at the very end of the table). What happens in this case is we tell the DB "hey can you give me any rows newer than {min_xid} and stop after N" and the DB says "Sure, I can do that!" and then it goes off and scans over the entire table and if we're lucky it eventually finishes that, many hours later, and gives us back the first N new rows, after which point we can start issuing much faster queries to read out the rest of the new data after those ones. But if anything interrupts that incremental scan, we have no recourse but to start all over again from the beginning. And this is clearly dumb, because our backfill queries proceed according to primary key (or CTID) order, so if we _knew_ how far the scan had gotten we could tell the DB to start from that point. But the only way we learn about that progress is when we get back a result row, so that doesn't work right when the result set is heavily filtered. This commit removes the DB-side filtering in favor of doing that filtering on the client side, and implements a bit of additional plumbing so that the backfill implementation can keep track of a "resume from here" key separate from any actual ChangeEvents it outputs.

We have a ton of fancy logic in SQL Server for attempting to encode text strings such that the encoded FDB tuples sort in a lexicographic order which actually matches the source DB's collation ordering of those strings. And that mostly works! Except apparently it has a subtle flaw we completely overlooked for like a year or two now. See, we treat the encoded "sort key" as a one-way function of the input, so we encode a string as a tuple (<tag>, <sort key>, <original string>) and then throw away the sorting key and use the original string as the key on decode. And this works, as long as you assume that strings are necessarily either less than or greater than each other. Which is mostly the case for primary keys! But it's not the case if the string is a non-final element of a composite primary key. In that case you could have two strings which are considered equal under the source DB collation yet have different representations. For instance "foo" and "FOO". And then even though the encoded _sort keys_ are considered equal, the value parts are unequal and that part comes before any _subsequent_ parts of the composite primary key and screws up the overall key ordering. And while it's possible to fix that, at this point I'm thinking that we have invested way too much time and effort into SQL Server text collation, and this is getting really dumb considering that MySQL and PostgreSQL just treat all text collation ordering as unable to be predicted and default to imprecise backfill semantics, and *we have never had a single complaint about that*. So let's just be consistent. SQL Server text primary keys will also cause us to default to imprecise backfill order.

…alogs This commit expands the connector's resource handling capabilities: Resource Additions: - Added new Pydantic models and bindings for: * Comments (with incremental sync support via created_at cursor) * Contacts * Gifts * Promotional Credits * Quote Line Groups (requires Performance/Enterprise plan) * Subscriptions with Scheduled Changes * Unbilled Charges Configuration Improvements: - Updated resource configurations to support both Product Catalog 1.0 and 2.0 - Enhanced validation logic for resource-specific configurations - Added proper type hints and validation rules for all new schemas API Integration Enhancements: - Improved handling of associated resources with parent-child relationships - Added support for filtering in associated resource queries - Implemented proper cursor handling for incremental syncs This enhancement provides broader coverage of Chargebee's API capabilities while maintaining proper error handling through the API's native responses.

We only need this prerequisite check when in read-only non-replica mode and it's incorrect to let that break a replica capture.

…o patch-1

willdonnelly and others added 30 commits February 27, 2025 13:54

source-redshift-batch: Update test snapshots

d96ea2e

source-redshift-batch: New keyless capture tests

e89d9f0

source-redshift-batch: Option 'Discover Views'

b04e958

source-redshift-batch: Require row_id only when keyless_row_id

8372377

source-redshift-batch: Fix overlooked snapshots

769f56e

source-bigquery-batch: Add /_meta/row_id and op

ef05f2b

source-bigquery-batch: Update test snapshots

ae88e55

source-bigquery-batch: New keyless capture tests

146de2c

source-bigquery-batch: Option 'Discover Views'

8251e8a

source-bigquery-batch: Require row_id only when keyless_row_id

b5cb4dd

materialize-postgres: include table name in logs for failed table alt…

9337fe2

…erations Adds information to the error and logging for failed table operations to assist in their resolution.

Revert "source-postgres-batch: Fix XID wraparound behavior"

ca52be5

This reverts commit e57acbc and un-fixes estuary#2383 Turns out the particular function we're using here doesn't work on a standby replica, which is...bad.

common/feature_flags: Trim whitespace

60033ae

Makes it so a setting like `do_foo, do_bar` parses the same as `do_foo,do_bar`. Without this fix it instead parses as a flag setting `" do_bar"` which is clearly very distinct from the `"do_bar"` flag.

source-mysql: Support BINARY/VARBINARY keys

f50ae76

source-postgres-batch: Set keyless_row_id by default

34fd576

source-mysql-batch: Set keyless_row_id by default

8e332fe

source-redshift-batch: Set keyless_row_id by default

10b33db

source-bigquery-batch: Set keyless_row_id by default

8845f7f

materialize-sql: more verbose logging of apply action failures

ec80e1c

I needed to get some better logs in order to troubleshoot an issue locally, and it seemed useful enough to commit.

willdonnelly and others added 29 commits April 2, 2025 09:25

source-postgres: Query/log current XID periodically

b7a6014

materialize: include the collection name in nullable key error message

f0218f3

Include the collection name in this error from Validate to help with finding a resolution.

source-oracle{-batch}: fix primary key ordinal position discovery

bf51f46

source-chargebee-native: add to ci

1f7853b

materialize-boilerplate: collection name in backfill counter error

44439eb

source-chargebee-native: update date filtering logic for IncrementalC…

2c98bc9

…hargebeeResource Modified the date filtering logic in the _fetch_resource_data function to utilize the cursor field specific to IncrementalChargebeeResource.

source-mongodb: enhanced logging for change stream lag time

c30d38f

This information was already available from the MongoDB timestamps, but make it easier to see at a glance how far behind a given change stream is at processing events vs. the latest cluster operation time.

source-salesforce-native: increase csv field size limit

46c6867

source-sqlserver: Skip cdc_log_scan_sessions check on replicas

d1ce710

We only need this prerequisite check when in read-only non-replica mode and it's incorrect to let that break a replica capture.

Merge branch 'patch-1' of https://github.com/jwhartley/connectors int…

710c814

…o patch-1

skord force-pushed the main branch from c53901d to 58c6ded Compare June 21, 2025 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minor materialization error message change #2541

Minor materialization error message change #2541

Uh oh!

jwhartley commented Mar 17, 2025 •

edited by jgraettinger

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Minor materialization error message change #2541

Are you sure you want to change the base?

Minor materialization error message change #2541

Uh oh!

Conversation

jwhartley commented Mar 17, 2025 • edited by jgraettinger Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

jwhartley commented Mar 17, 2025 •

edited by jgraettinger

Loading