Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream): add dedicated metrics for sync log store #20907

Merged
merged 18 commits into from
Apr 1, 2025

Conversation

kwannoel
Copy link
Contributor

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

Copy link
Contributor Author

kwannoel commented Mar 14, 2025

@kwannoel kwannoel changed the title record more metrics feat(stream): record more metrics Mar 14, 2025
@kwannoel kwannoel changed the title feat(stream): record more metrics feat(stream): add dedicated metrics for sync log store Mar 14, 2025
@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch 2 times, most recently from 3fd1c42 to 9315111 Compare March 17, 2025 09:59
@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch 2 times, most recently from 3f41c4e to 0cbff0f Compare March 21, 2025 18:16
@kwannoel kwannoel marked this pull request as ready for review March 22, 2025 04:19
@kwannoel kwannoel requested review from wenym1 and chenzl25 March 22, 2025 04:20
Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a screenshot of the grafana dashboard applied with this PR?

@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch from 0cbff0f to 83e5903 Compare March 25, 2025 13:39
@kwannoel
Copy link
Contributor Author

Screenshot 2025-03-25 at 11 14 55 PM
Screenshot 2025-03-25 at 11 15 17 PM

@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch 2 times, most recently from 32698a4 to fc13ccd Compare March 25, 2025 15:18
@kwannoel kwannoel requested a review from wenym1 March 27, 2025 04:42
@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch from fc13ccd to 8d14284 Compare March 27, 2025 04:43
/// `target`: refers to the target of the log store,
/// for instance `MySql` Sink, PG sink, etc...
/// or unaligned join.
pub(crate) fn new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metrics here should only be used by synced kv log store, and should have a fixed target? For metrics not shared with unsynced kv log store, we don't even need the target label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used by decoupled sink into table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to specify the target in the future when we have multiple usages on synced kv log store executor? From the current stream node, we don't have field to store the target yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it in the stream node later, when we use it for decoupled sink.

"",
[
panels.target(
f"sum({metric('sync_kv_log_store_state')}) by (type, fragment_id, relation)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use rate to measure the rate of transition, instead of the counter value itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if we want to monitor the current state rather than the transition rate, we may use gauge rather than counter.

Copy link
Contributor Author

@kwannoel kwannoel Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to store 2 counters, and subtract them from each other. If we directly use gauge, if the state transitions to dirty, and in a short instant quickly changes to clean, but only does so occasionally, it might be missed when prometheus collects metrics. This is because prometheus only collects at some fixed interval. The metric value will just stay at 0. On the other hand, if I collect both clean and dirty, I'm able to differentiate this, since I can check the clean and dirty state transitions.

I will hide the first panel in this group, and only expose the current state metric panel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If so, I think we can replace the sum in the first panel with rate, and unhide it to shows the transition rate into the two states.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch 2 times, most recently from 003ed21 to 34ad60a Compare March 28, 2025 06:30
@kwannoel kwannoel requested a review from wenym1 March 28, 2025 09:29
Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share a screenshot on grafana of the latest code?

yield Message::Barrier(barrier);
self.metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current implementation we have to carefully ensure that we do this measurement in every yield point. We may make this logic of measuring back-pressure to be more general and simpler.

It can be a wrapper over any inner stream. When it implement Stream, it start a timer every time its poll_next returns ready, and record the time elpased between its next call on poll_next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

tracing::trace!("resuming paused future");
if let Some(sleep_future) = sleep_future {
let deadline = sleep_future.deadline();
let now = Instant::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check here seems unnecessary. Neither the warning log nor the trace log seems to provide notable information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

"",
[
panels.target(
f"sum({metric('sync_kv_log_store_state')}) by (type, fragment_id, relation)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If so, I think we can replace the sum in the first panel with rate, and unhide it to shows the transition rate into the two states.

@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch from 16c7e1e to 96449c2 Compare March 28, 2025 15:31
@kwannoel
Copy link
Contributor Author

Screenshot 2025-03-29 at 12 26 36 AM Screenshot 2025-03-29 at 12 27 04 AM Screenshot 2025-03-29 at 12 27 21 AM

@kwannoel kwannoel requested a review from wenym1 April 1, 2025 02:12
@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch from 7152be4 to 7a074cf Compare April 1, 2025 06:26
Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM.

@@ -113,12 +113,12 @@ pub mod metrics {
// state of the log store
pub unclean_state: LabelGuardedIntCounter<5>,
pub clean_state: LabelGuardedIntCounter<5>,
pub wait_next_poll_ns: LabelGuardedIntCounter<4>,
pub wait_next_poll_ns: Option<LabelGuardedIntCounter<4>>, // Allow us to take it later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can clone it for simplicity when we use it.

@@ -427,10 +430,10 @@ impl<S: LocalStateStore> WriteFuture<S> {
stream: BoxedMessageStream,
write_state: LogStoreWriteState<S>,
) -> Self {
let instant = Instant::now() + duration;
tracing::trace!(?instant, ?duration, "write_future_pause");
tracing::trace!(now = ?Instant::now(), ?duration, "write_future_pause");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant to call Instant::now() in consecutively 3 lines. Can call once and reuse later.

/// `target`: refers to the target of the log store,
/// for instance `MySql` Sink, PG sink, etc...
/// or unaligned join.
pub(crate) fn new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to specify the target in the future when we have multiple usages on synced kv log store executor? From the current stream node, we don't have field to store the target yet.

@kwannoel kwannoel force-pushed the kwannoel/sync-log-store-metrics branch from c064c27 to 364fba8 Compare April 1, 2025 08:20
@kwannoel kwannoel added this pull request to the merge queue Apr 1, 2025
Merged via the queue into main with commit 7f11530 Apr 1, 2025
28 of 29 checks passed
@kwannoel kwannoel deleted the kwannoel/sync-log-store-metrics branch April 1, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants