Scrape retained catalog data as CSVs #31976

SangJunBak · 2025-03-20T22:09:39Z

Based on PR #31859

Chain of upstream PRs as of 2025-03-25

PR Add system catalog dumper for self-managed debug tool #31859:
main ← jun/#8908/dump-catalog-2
- PR Scrape retained catalog data as CSVs #31976 (THIS ONE):
  jun/#8908/dump-catalog-2 ← jun/#8908/dump-catalog-4

See commit messages for details

Motivation

This PR adds a known-desirable feature.
https://github.com/MaterializeInc/database-issues/issues/8908

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.

ParkMyCar

Only comment is I'm not really sure why we want to use a SUBSCRIBE over a SELECT?

src/self-managed-debug/src/system_catalog_dumper.rs

ParkMyCar · 2025-03-21T20:56:39Z

src/self-managed-debug/src/system_catalog_dumper.rs

+            let copy_query = format!(
+                "COPY (SUBSCRIBE TO (SELECT * FROM {})) TO STDOUT WITH (FORMAT CSV);",
+                relation.name
+            );
+
+            let copy_fut = write_copy_stream(transaction, &copy_query, &mut file, relation.name);
+
+            // We use a timeout to cut the SUBSCRIBE query short since it's expected to run indefinitely.
+            // Alternatively, we could use a `DECLARE...FETCH ALL` for the same effect, but then we'd have
+            // to format the result as CSV ourselves, leading to more code. Another alternative is to
+            // specify an UPTO, but it gets finicky to get the UPTO frontier right since we can't rely on
+            // wallclock time.
+            let res = timeout(SUBSCRIBE_SCRAPE_TIMEOUT, copy_fut).await;


Why are we switching from a SELECT to a SUBSCRIBE? Especially if we're using a timeout to cancel a SUBSCRIBE I don't think there are any guarantees about what we'll actually receive?

To get the retained history, including the snapshot! I don't think there's any other way of getting retained data, but I might be wrong?

I don't think there are any guarantees about what we'll actually receive?

So if it's empty, I think that's fine! But I think you're right that there's no guarantee that we'll get the full result set using the method. My reasoning though is it isn't that much worse than using DECLARE ... FETCH ALL given we'd need to supply a timeout to FETCH ALL anyways. I think FETCH ALL does have stronger guarantees that, if the fetch doesn't timeout, will terminate once ALL the rows at the time of FETCH is received. But I didn't think that guarantee was worth the tradeoff of parsing the result stream into a CSV ourselves. But curious on your take!

I think it's definitely worth it to use FETCH ALL, because otherwise reasoning about what data we get is pretty difficult. Also tuning the SUBSCRIBE_SCRAPE_TIMEOUT would be hard because we'd need to account for network lag, data set sizes, etc.

If you'd rather merge as-is and follow up, that works for me, but I would really like to see this change! Happy to help write the logic to transform the result stream into CSVs :)

Also tuning the SUBSCRIBE_SCRAPE_TIMEOUT would be hard because we'd need to account for network lag, data set sizes, etc.

But wouldn't we still need the timeout though when using FETCH ALL?

because otherwise reasoning about what data we get is pretty difficult

I agree with you here! I'll push up a change to this PR~

But wouldn't we still need the timeout though when using FETCH ALL?

What I'm thinking of is FETCH ALL FROM <cursor> WITH (timeout = '0s'); I haven't tested it but I believe this should return a snapshot and all of the retained history that is immediately available. In other words it returns everything from now() - retained history up until now(), which is the goal?

We'll still need to wrap the query in a timeout in case the connection or database is slow, but I don't think we'll ever receive a partial set of data. For example, if we timeout the SUBSCRIBE we could receive the initial snapshot and then some indeterminate amount of diffs.

I might be wrong here though! Please feel free to correct me :)

it returns everything from now() - retained history up until now()

Ah that's a really good point. My logic is because it's streaming in sequentially (ordered by mz_timestamp), even if results are indeterminate, it'll still at least be correct / in order. But having the guarantee that we'll never get the snapshot cut off is definitely important.

This commit should fix it! b8d117a (#31976)

src/self-managed-debug/src/system_catalog_dumper.rs

We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.

- Introduces a timeout for subscribe queries to prevent indefinite execution - Refactors column name retrieval to not run in the same transaction - For each ::Retained relation, we add a ::Basic version that solely does a simple SELECT. We need to separate the two because we can't do both a SUBSCRIBE and SELECT query in the same transaction. There's a way to do the same behavior with just a ::Retained relation, but this approach leads to less code and we'd have to add this logic anyways for our iterators. I also think it's nice to explictly create a relation per query and it extends our retries to each of these different queries too.

- Introduced the `csv-async` crate to facilitate asynchronous CSV serialization.

# Based on PR #31976 ## Chain of upstream PRs as of 2025-03-25 * PR #31859: `main` ← `jun/#8908/dump-catalog-2` * PR #31976: `jun/#8908/dump-catalog-2` ← `jun/#8908/dump-catalog-4` * **PR #32007 (THIS ONE)**: `jun/#8908/dump-catalog-4` ← `jun/#8908/emulator-support`  This is just to get the base of emulator support going. Things TODO are: - Allowing catalog scraping via a connection URL - Modifying the UI to separate self-managed from the emulator You can see the full list in the linked ticket! See commit messages for details ### Motivation * This PR adds a known-desirable feature. MaterializeInc/database-issues#8908  ### Tips for reviewer  ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [ ] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post. ### Motivation  ### Tips for reviewer  ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [ ] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.

SangJunBak requested a review from ParkMyCar March 20, 2025 22:09

ParkMyCar reviewed Mar 21, 2025

View reviewed changes

SangJunBak force-pushed the jun/#8908/dump-catalog-4 branch from feb0b83 to 02c630e Compare March 25, 2025 04:31

Base automatically changed from jun/#8908/dump-catalog-3 to jun/#8908/dump-catalog-2 March 25, 2025 04:35

SangJunBak force-pushed the jun/#8908/dump-catalog-4 branch from 02c630e to e17f192 Compare March 25, 2025 05:10

SangJunBak mentioned this pull request Mar 25, 2025

Dump docker debug info for emulator #32007

Merged

10 tasks

SangJunBak requested a review from ParkMyCar March 25, 2025 14:50

Base automatically changed from jun/#8908/dump-catalog-2 to main March 25, 2025 18:24

ParkMyCar approved these changes Mar 26, 2025

View reviewed changes

SangJunBak added 4 commits March 26, 2025 16:34

Move transaction creation to per retry rather than per query

aae07de

We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.

Print error stack rather than most recent context

9b45d1d

Use FETCH ALL for SUBSCRIBE CSV export

b8d117a

- Introduced the `csv-async` crate to facilitate asynchronous CSV serialization.

SangJunBak force-pushed the jun/#8908/dump-catalog-4 branch from 1d13dc0 to b8d117a Compare March 26, 2025 20:34

SangJunBak enabled auto-merge (squash) March 27, 2025 20:13

SangJunBak merged commit 6e89e33 into main Mar 27, 2025
80 checks passed

SangJunBak deleted the jun/#8908/dump-catalog-4 branch March 27, 2025 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape retained catalog data as CSVs #31976

Scrape retained catalog data as CSVs #31976

SangJunBak commented Mar 20, 2025 •

edited

Loading

ParkMyCar left a comment

ParkMyCar Mar 21, 2025

SangJunBak Mar 24, 2025

ParkMyCar Mar 26, 2025

SangJunBak Mar 26, 2025

ParkMyCar Mar 26, 2025

SangJunBak Mar 26, 2025

SangJunBak Mar 26, 2025

Scrape retained catalog data as CSVs #31976

Scrape retained catalog data as CSVs #31976

Conversation

SangJunBak commented Mar 20, 2025 • edited Loading

Based on PR #31859

Chain of upstream PRs as of 2025-03-25

Motivation

Tips for reviewer

Checklist

ParkMyCar left a comment

Choose a reason for hiding this comment

ParkMyCar Mar 21, 2025

Choose a reason for hiding this comment

SangJunBak Mar 24, 2025

Choose a reason for hiding this comment

ParkMyCar Mar 26, 2025

Choose a reason for hiding this comment

SangJunBak Mar 26, 2025

Choose a reason for hiding this comment

ParkMyCar Mar 26, 2025

Choose a reason for hiding this comment

SangJunBak Mar 26, 2025

Choose a reason for hiding this comment

SangJunBak Mar 26, 2025

Choose a reason for hiding this comment

SangJunBak commented Mar 20, 2025 •

edited

Loading