Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape retained catalog data as CSVs #31976

Merged
merged 5 commits into from
Mar 27, 2025
Merged

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Mar 20, 2025

Based on PR #31859

Chain of upstream PRs as of 2025-03-25

See commit messages for details

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
    We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.

@SangJunBak SangJunBak requested a review from ParkMyCar March 20, 2025 22:09
Copy link
Member

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only comment is I'm not really sure why we want to use a SUBSCRIBE over a SELECT?

Comment on lines 727 to 744
let copy_query = format!(
"COPY (SUBSCRIBE TO (SELECT * FROM {})) TO STDOUT WITH (FORMAT CSV);",
relation.name
);

let copy_fut = write_copy_stream(transaction, &copy_query, &mut file, relation.name);

// We use a timeout to cut the SUBSCRIBE query short since it's expected to run indefinitely.
// Alternatively, we could use a `DECLARE...FETCH ALL` for the same effect, but then we'd have
// to format the result as CSV ourselves, leading to more code. Another alternative is to
// specify an UPTO, but it gets finicky to get the UPTO frontier right since we can't rely on
// wallclock time.
let res = timeout(SUBSCRIBE_SCRAPE_TIMEOUT, copy_fut).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we switching from a SELECT to a SUBSCRIBE? Especially if we're using a timeout to cancel a SUBSCRIBE I don't think there are any guarantees about what we'll actually receive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get the retained history, including the snapshot! I don't think there's any other way of getting retained data, but I might be wrong?

I don't think there are any guarantees about what we'll actually receive?

So if it's empty, I think that's fine! But I think you're right that there's no guarantee that we'll get the full result set using the method. My reasoning though is it isn't that much worse than using DECLARE ... FETCH ALL given we'd need to supply a timeout to FETCH ALL anyways. I think FETCH ALL does have stronger guarantees that, if the fetch doesn't timeout, will terminate once ALL the rows at the time of FETCH is received. But I didn't think that guarantee was worth the tradeoff of parsing the result stream into a CSV ourselves. But curious on your take!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's definitely worth it to use FETCH ALL, because otherwise reasoning about what data we get is pretty difficult. Also tuning the SUBSCRIBE_SCRAPE_TIMEOUT would be hard because we'd need to account for network lag, data set sizes, etc.

If you'd rather merge as-is and follow up, that works for me, but I would really like to see this change! Happy to help write the logic to transform the result stream into CSVs :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tuning the SUBSCRIBE_SCRAPE_TIMEOUT would be hard because we'd need to account for network lag, data set sizes, etc.

But wouldn't we still need the timeout though when using FETCH ALL?

because otherwise reasoning about what data we get is pretty difficult

I agree with you here! I'll push up a change to this PR~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't we still need the timeout though when using FETCH ALL?

What I'm thinking of is FETCH ALL FROM <cursor> WITH (timeout = '0s'); I haven't tested it but I believe this should return a snapshot and all of the retained history that is immediately available. In other words it returns everything from now() - retained history up until now(), which is the goal?

We'll still need to wrap the query in a timeout in case the connection or database is slow, but I don't think we'll ever receive a partial set of data. For example, if we timeout the SUBSCRIBE we could receive the initial snapshot and then some indeterminate amount of diffs.

I might be wrong here though! Please feel free to correct me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it returns everything from now() - retained history up until now()

Ah that's a really good point. My logic is because it's streaming in sequentially (ordered by mz_timestamp), even if results are indeterminate, it'll still at least be correct / in order. But having the guarantee that we'll never get the snapshot cut off is definitely important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit should fix it! b8d117a (#31976)

@SangJunBak SangJunBak force-pushed the jun/#8908/dump-catalog-4 branch from feb0b83 to 02c630e Compare March 25, 2025 04:31
Base automatically changed from jun/#8908/dump-catalog-3 to jun/#8908/dump-catalog-2 March 25, 2025 04:35
@SangJunBak SangJunBak force-pushed the jun/#8908/dump-catalog-4 branch from 02c630e to e17f192 Compare March 25, 2025 05:10
@SangJunBak SangJunBak requested a review from ParkMyCar March 25, 2025 14:50
Base automatically changed from jun/#8908/dump-catalog-2 to main March 25, 2025 18:24
We were running into this error where if we tried to retry, because the transaction was hoisted, we couldn't rollback without moving ownership of it. By cancelling using the client, we avoid having to hoist the transaction.
- Introduces a timeout for subscribe queries to prevent indefinite execution
- Refactors column name retrieval to not run in the same transaction
- For each ::Retained relation, we add a ::Basic version that solely does a simple SELECT. We need to separate the two because we can't do both a SUBSCRIBE and SELECT query in the same transaction. There's a way to do the same behavior with just a ::Retained relation, but this approach leads to less code and we'd have to add this logic anyways for our iterators. I also think it's nice to explictly create a relation per query and it extends our retries to each of these different queries too.
- Introduced the `csv-async` crate to facilitate asynchronous CSV serialization.
@SangJunBak SangJunBak force-pushed the jun/#8908/dump-catalog-4 branch from 1d13dc0 to b8d117a Compare March 26, 2025 20:34
<!-- start git-machete generated -->

# Based on PR #31976

## Chain of upstream PRs as of 2025-03-25

* PR #31859:
  `main` ← `jun/#8908/dump-catalog-2`

  * PR #31976:
    `jun/#8908/dump-catalog-2` ← `jun/#8908/dump-catalog-4`

    * **PR #32007 (THIS ONE)**:
      `jun/#8908/dump-catalog-4` ← `jun/#8908/emulator-support`

<!-- end git-machete generated -->

This is just to get the base of emulator support going. Things TODO are:
- Allowing catalog scraping via a connection URL
- Modifying the UI to separate self-managed from the emulator
You can see the full list in the linked ticket!

See commit messages for details

### Motivation


  * This PR adds a known-desirable feature.

MaterializeInc/database-issues#8908


<!--
Which of the following best describes the motivation behind this PR?

  * This PR fixes a recognized bug.

    [Ensure issue is linked somewhere.]



  * This PR fixes a previously unreported bug.

    [Describe the bug in detail, as if you were filing a bug report.]

  * This PR adds a feature that has not yet been specified.

[Write a brief specification for the feature, including justification
for its inclusion in Materialize, as if you were writing the original
     feature specification.]

   * This PR refactors existing code.

[Describe what was wrong with the existing code, if it is not obvious.]
-->

### Tips for reviewer

<!--
Leave some tips for your reviewer, like:

    * The diff is much smaller if viewed with whitespace hidden.
    * [Some function/module/file] deserves extra attention.
* [Some function/module/file] is pure code movement and only needs a
skim.

Delete this section if no tips.
-->

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design

doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc

([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`

mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior

changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.<!--
Describe the contents of the PR briefly but completely.

If you write detailed commit messages, it is acceptable to copy/paste
them
here, or write "see commit messages for details." If there is only one
commit
in the PR, GitHub will have already added its commit message above.
-->

### Motivation

<!--
Which of the following best describes the motivation behind this PR?

  * This PR fixes a recognized bug.

    [Ensure issue is linked somewhere.]

  * This PR adds a known-desirable feature.

    [Ensure issue is linked somewhere.]

  * This PR fixes a previously unreported bug.

    [Describe the bug in detail, as if you were filing a bug report.]

  * This PR adds a feature that has not yet been specified.

[Write a brief specification for the feature, including justification
for its inclusion in Materialize, as if you were writing the original
     feature specification.]

   * This PR refactors existing code.

[Describe what was wrong with the existing code, if it is not obvious.]
-->

### Tips for reviewer

<!--
Leave some tips for your reviewer, like:

    * The diff is much smaller if viewed with whitespace hidden.
    * [Some function/module/file] deserves extra attention.
* [Some function/module/file] is pure code movement and only needs a
skim.

Delete this section if no tips.
-->

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
@SangJunBak SangJunBak enabled auto-merge (squash) March 27, 2025 20:13
@SangJunBak SangJunBak merged commit 6e89e33 into main Mar 27, 2025
80 checks passed
@SangJunBak SangJunBak deleted the jun/#8908/dump-catalog-4 branch March 27, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants