-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape retained catalog data as CSVs #31976
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only comment is I'm not really sure why we want to use a SUBSCRIBE
over a SELECT
?
let copy_query = format!( | ||
"COPY (SUBSCRIBE TO (SELECT * FROM {})) TO STDOUT WITH (FORMAT CSV);", | ||
relation.name | ||
); | ||
|
||
let copy_fut = write_copy_stream(transaction, ©_query, &mut file, relation.name); | ||
|
||
// We use a timeout to cut the SUBSCRIBE query short since it's expected to run indefinitely. | ||
// Alternatively, we could use a `DECLARE...FETCH ALL` for the same effect, but then we'd have | ||
// to format the result as CSV ourselves, leading to more code. Another alternative is to | ||
// specify an UPTO, but it gets finicky to get the UPTO frontier right since we can't rely on | ||
// wallclock time. | ||
let res = timeout(SUBSCRIBE_SCRAPE_TIMEOUT, copy_fut).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we switching from a SELECT
to a SUBSCRIBE
? Especially if we're using a timeout to cancel a SUBSCRIBE
I don't think there are any guarantees about what we'll actually receive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get the retained history, including the snapshot! I don't think there's any other way of getting retained data, but I might be wrong?
I don't think there are any guarantees about what we'll actually receive?
So if it's empty, I think that's fine! But I think you're right that there's no guarantee that we'll get the full result set using the method. My reasoning though is it isn't that much worse than using DECLARE ... FETCH ALL
given we'd need to supply a timeout to FETCH ALL
anyways. I think FETCH ALL
does have stronger guarantees that, if the fetch doesn't timeout, will terminate once ALL the rows at the time of FETCH
is received. But I didn't think that guarantee was worth the tradeoff of parsing the result stream into a CSV ourselves. But curious on your take!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's definitely worth it to use FETCH ALL
, because otherwise reasoning about what data we get is pretty difficult. Also tuning the SUBSCRIBE_SCRAPE_TIMEOUT
would be hard because we'd need to account for network lag, data set sizes, etc.
If you'd rather merge as-is and follow up, that works for me, but I would really like to see this change! Happy to help write the logic to transform the result stream into CSVs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also tuning the SUBSCRIBE_SCRAPE_TIMEOUT would be hard because we'd need to account for network lag, data set sizes, etc.
But wouldn't we still need the timeout though when using FETCH ALL
?
because otherwise reasoning about what data we get is pretty difficult
I agree with you here! I'll push up a change to this PR~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wouldn't we still need the timeout though when using
FETCH ALL
?
What I'm thinking of is FETCH ALL FROM <cursor> WITH (timeout = '0s');
I haven't tested it but I believe this should return a snapshot and all of the retained history that is immediately available. In other words it returns everything from now() - retained history
up until now()
, which is the goal?
We'll still need to wrap the query in a timeout in case the connection or database is slow, but I don't think we'll ever receive a partial set of data. For example, if we timeout the SUBSCRIBE
we could receive the initial snapshot and then some indeterminate amount of diffs.
I might be wrong here though! Please feel free to correct me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it returns everything from now() - retained history up until now()
Ah that's a really good point. My logic is because it's streaming in sequentially (ordered by mz_timestamp
), even if results are indeterminate, it'll still at least be correct / in order. But having the guarantee that we'll never get the snapshot cut off is definitely important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit should fix it! b8d117a
(#31976)
feb0b83
to
02c630e
Compare
02c630e
to
e17f192
Compare
We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.
- Introduces a timeout for subscribe queries to prevent indefinite execution - Refactors column name retrieval to not run in the same transaction - For each ::Retained relation, we add a ::Basic version that solely does a simple SELECT. We need to separate the two because we can't do both a SUBSCRIBE and SELECT query in the same transaction. There's a way to do the same behavior with just a ::Retained relation, but this approach leads to less code and we'd have to add this logic anyways for our iterators. I also think it's nice to explictly create a relation per query and it extends our retries to each of these different queries too.
- Introduced the `csv-async` crate to facilitate asynchronous CSV serialization.
1d13dc0
to
b8d117a
Compare
<!-- start git-machete generated --> # Based on PR #31976 ## Chain of upstream PRs as of 2025-03-25 * PR #31859: `main` ← `jun/#8908/dump-catalog-2` * PR #31976: `jun/#8908/dump-catalog-2` ← `jun/#8908/dump-catalog-4` * **PR #32007 (THIS ONE)**: `jun/#8908/dump-catalog-4` ← `jun/#8908/emulator-support` <!-- end git-machete generated --> This is just to get the base of emulator support going. Things TODO are: - Allowing catalog scraping via a connection URL - Modifying the UI to separate self-managed from the emulator You can see the full list in the linked ticket! See commit messages for details ### Motivation * This PR adds a known-desirable feature. MaterializeInc/database-issues#8908 <!-- Which of the following best describes the motivation behind this PR? * This PR fixes a recognized bug. [Ensure issue is linked somewhere.] * This PR fixes a previously unreported bug. [Describe the bug in detail, as if you were filing a bug report.] * This PR adds a feature that has not yet been specified. [Write a brief specification for the feature, including justification for its inclusion in Materialize, as if you were writing the original feature specification.] * This PR refactors existing code. [Describe what was wrong with the existing code, if it is not obvious.] --> ### Tips for reviewer <!-- Leave some tips for your reviewer, like: * The diff is much smaller if viewed with whitespace hidden. * [Some function/module/file] deserves extra attention. * [Some function/module/file] is pure code movement and only needs a skim. Delete this section if no tips. --> ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [ ] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.<!-- Describe the contents of the PR briefly but completely. If you write detailed commit messages, it is acceptable to copy/paste them here, or write "see commit messages for details." If there is only one commit in the PR, GitHub will have already added its commit message above. --> ### Motivation <!-- Which of the following best describes the motivation behind this PR? * This PR fixes a recognized bug. [Ensure issue is linked somewhere.] * This PR adds a known-desirable feature. [Ensure issue is linked somewhere.] * This PR fixes a previously unreported bug. [Describe the bug in detail, as if you were filing a bug report.] * This PR adds a feature that has not yet been specified. [Write a brief specification for the feature, including justification for its inclusion in Materialize, as if you were writing the original feature specification.] * This PR refactors existing code. [Describe what was wrong with the existing code, if it is not obvious.] --> ### Tips for reviewer <!-- Leave some tips for your reviewer, like: * The diff is much smaller if viewed with whitespace hidden. * [Some function/module/file] deserves extra attention. * [Some function/module/file] is pure code movement and only needs a skim. Delete this section if no tips. --> ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [ ] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.
Based on PR #31859
Chain of upstream PRs as of 2025-03-25
PR Add system catalog dumper for self-managed debug tool #31859:
main
←jun/#8908/dump-catalog-2
jun/#8908/dump-catalog-2
←jun/#8908/dump-catalog-4
See commit messages for details
Motivation
https://github.com/MaterializeInc/database-issues/issues/8908
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.