-
Notifications
You must be signed in to change notification settings - Fork 129
Record ZooKeeperCommandExecutor timings #1191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Motivation: Currently, there are no metrics to identify wihch part of a `Command` execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric. - Command execution time - Log replay time - Log store time Modifications: - Add `ReplicationTimings` to recode the timings. - Add `ReplicationMetrics` to export the recorded metrics per project through Micrometer. Result: You can now use the following four metrics to measure the write performance of Central Dogma. - `replication.lock.waiting` - `replication.command.execution` - `replication.log.replay` - `replication.log.store`
WalkthroughAdds a replication timing subsystem (metrics, interface, noop and default implementations), instruments ZooKeeperCommandExecutor with timing hooks across executor queue, lock acquisition/release, log replay, command execution, and log store phases, and introduces tests validating recorded metrics. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ZKExec as ZooKeeperCommandExecutor
participant Timings as ReplicationTimings
participant Metrics as ReplicationMetrics
participant Lock as InterProcessMutex
Client->>ZKExec: doExecute(command)
ZKExec->>ZKExec: timings = newReplicationTimings(command)
ZKExec->>Timings: startExecutorSubmit()
ZKExec->>Timings: startExecutorExecution()
ZKExec->>ZKExec: blockingExecute(command, timings)
alt acquire lock
ZKExec->>Timings: startLockAcquisition(now)
ZKExec->>Lock: acquire()
Lock-->>ZKExec: acquired
ZKExec->>Timings: endLockAcquisition(true)
else fail to acquire
ZKExec->>Timings: endLockAcquisition(false)
end
ZKExec->>Timings: startLogReplay()
ZKExec->>ZKExec: replayLogs()
ZKExec->>Timings: endLogReplay()
ZKExec->>Timings: startCommandExecution()
ZKExec->>ZKExec: command.execute()
ZKExec->>Timings: endCommandExecution()
ZKExec->>Timings: startLogStore()
ZKExec->>ZKExec: storeLogs()
ZKExec->>Timings: endLogStore()
ZKExec->>Timings: startLockRelease()
ZKExec->>Lock: release()
ZKExec->>Timings: endLockRelease()
ZKExec->>Timings: record()
Timings->>Metrics: recordTimer(duration) %% Metrics stores to Micrometer
Metrics-->>Timings: ok
ZKExec->>Timings: toText()
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
RepositoryCommand(35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: build-ubuntu-latest-jdk-17-coverage
- GitHub Check: build-ubuntu-latest-jdk-11
- GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
- GitHub Check: build-macos-latest-jdk-21
- GitHub Check: build-windows-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-21-snapshot
- GitHub Check: flaky-tests
- GitHub Check: lint
| void record() { | ||
| if (metrics == null) { | ||
| return; | ||
| } | ||
|
|
||
| if (lockAcquired) { | ||
| metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | ||
| } else { | ||
| metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | ||
| } | ||
| metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); | ||
| metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); | ||
| metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid recording metrics for phases that never started.
If we fail before startCommandExecution()/startLogReplay()/startLogStore() runs (e.g., lock acquisition timeout), these timers still record(0, …), bumping the sample count and hiding that the phase never happened. Guard each record with *_StartNanos != 0 so only completed phases contribute measurements, leaving lock failures to be represented solely by the lock timer.
Apply this diff:
@@
- if (lockAcquired) {
- metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
- } else {
- metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
- }
- metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
- metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
- metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+ if (lockAcquisitionStartNanos != 0) {
+ if (lockAcquired) {
+ metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+ } else {
+ metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+ }
+ }
+ if (commandExecutionStartNanos != 0) {
+ metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
+ }
+ if (logReplayStartNanos != 0) {
+ metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
+ }
+ if (logStoreStartNanos != 0) {
+ metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void record() { | |
| if (metrics == null) { | |
| return; | |
| } | |
| if (lockAcquired) { | |
| metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } else { | |
| metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); | |
| metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); | |
| metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| void record() { | |
| if (metrics == null) { | |
| return; | |
| } | |
| if (lockAcquisitionStartNanos != 0) { | |
| if (lockAcquired) { | |
| metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } else { | |
| metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| } | |
| if (commandExecutionStartNanos != 0) { | |
| metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| if (logReplayStartNanos != 0) { | |
| metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| if (logStoreStartNanos != 0) { | |
| metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| } |
🤖 Prompt for AI Agents
In
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
around lines 77 to 90, the record() method currently records zero-duration
metrics for phases that never started; update the method to only call each phase
timer.record(...) if that phase's start timestamp field (e.g.,
commandExecutionStartNanos, logReplayStartNanos, logStoreStartNanos) is non-zero
so that only completed phases contribute measurements; keep the existing lock
timer behavior (it should still record based on
lockAcquired/lockAcquisitionDurationNanos) and add simple guards like if
(commandExecutionStartNanos != 0) { ... } for each of the three phase timers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍 👍
I'm not an expert of ZooKeeper but how about also collecting the lock release time since we collect all kinds of Replication metrics?
Line 973 in 6a58a9f
| return () -> safeRelease(mtx); |
...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java
Show resolved
Hide resolved
…mmand-executor-metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (1)
214-220: Clarify the intent of the TODO comment.The TODO comment indicates this debug logging is temporary and should be removed after debugging. Should this be removed before merging this PR, or is it intentional to keep it temporarily for production debugging?
Since this is debug-level logging, it won't appear in production logs unless explicitly enabled, which may justify keeping it temporarily. However, temporary code with TODOs can accumulate if not tracked properly.
server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (1)
56-57: Consider conditional timing capture to avoid overhead when debug is disabled.The start time is captured unconditionally but only used when debug logging is enabled (line 87). While the overhead of
System.nanoTime()is minimal, you could optimize by checkingisDebugEnabled()first.Apply this diff if you want to avoid the overhead:
- final long startTime = System.nanoTime(); - try { + final long startTime; + if (logger.isDebugEnabled()) { + startTime = System.nanoTime(); + } final JsonNode node = (JsonNode) delegate.convertRequest(ctx, request, JsonNode.class, null);And update the finally block:
} finally { if (logger.isDebugEnabled()) { - // TODO(ikhoon): Remove the logging after debugging. - logger.debug("{}: took {} to convert request to changes", - ctx, TextFormatter.elapsed(startTime, System.nanoTime())); + final long endTime = System.nanoTime(); + logger.debug("{}: took {} to convert request to changes", + ctx, TextFormatter.elapsed(startTime, endTime)); } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java(3 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java(2 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
RepositoryCommand(35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-windows-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-11
- GitHub Check: docker-build
- GitHub Check: build-ubuntu-latest-jdk-17-coverage
- GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
- GitHub Check: build-macos-latest-jdk-21
- GitHub Check: flaky-tests
- GitHub Check: build-ubuntu-latest-jdk-21-snapshot
- GitHub Check: lint
🔇 Additional comments (13)
server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (2)
43-44: LGTM!Standard SLF4J imports for logging infrastructure.
107-108: LGTM!Logger field follows standard SLF4J patterns.
server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (5)
26-28: LGTM!The logging imports and TextFormatter are standard additions for timing instrumentation.
Also applies to: 35-35
47-48: LGTM!Standard logger declaration following conventions.
81-85: LGTM!The array processing logic is unchanged, just moved within the try block for timing instrumentation.
86-92: Clarify the intent of the TODO comment.The TODO comment suggests removing this logging after debugging (line 88), but the PR's objective is to add permanent timing metrics. Please clarify:
- Is this debug logging temporary for initial rollout troubleshooting?
- Or should it be permanent instrumentation, in which case the TODO should be removed?
If it's temporary, consider tracking its removal with an issue. If it's permanent, the comment should be updated or removed.
64-79: All verification checks pass - code changes are sound and backward compatible.The codebase already exercises single-change scenarios in tests (e.g., ZooKeeperCommandExecutorTest uses
ImmutableList.of(pushChange)in multiple places), confirming the early-return optimization at line 77 is valid. The stricter validation for malformed input is a beneficial defensive improvement.server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (2)
86-98: LGTM!Using
projectNameas the sole equality determinant is appropriate, as each project should have exactly oneReplicationMetricsinstance. The timer fields are derived fromprojectNameand don't need to participate in equality checks.
38-56: Verify the PR description against the implementation.The review comment asserts that the PR description specifies four metrics, but I cannot access the actual PR description to confirm this claim. The implementation in ReplicationMetrics.java is internally consistent with seven properly defined and exposed timers, but without verifying the PR description's stated scope, I cannot determine whether:
- The implementation evolved beyond the initial PR scope (legitimate change)
- The review comment's characterization of the PR is inaccurate
- The PR description was updated but the review comment was not
Please confirm the PR description's intended metric set and whether the seven metrics in the current implementation align with the PR's stated objectives.
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (4)
912-969: Verify lock release timing instrumentation.The lock acquisition timing is correctly instrumented (lines 914, 953), but the lock release timing is unclear. The
SafeCloseablereturned at line 968 invokessafeRelease(mtx), yet I don't see correspondingstartLockRelease/endLockReleasecalls, despiteReplicationMetricsexposing alockReleaseTimer.If lock release timing is handled within
ReplicationTimingsor elsewhere, this is fine. Otherwise, consider whether release timing should be instrumented:return () -> { timings.startLockRelease(); try { safeRelease(mtx); } finally { timings.endLockRelease(); } };
1082-1092: LGTM!Scoping metrics to
RepositoryCommandis appropriate, as replication metrics are project-specific. Non-repository commands (e.g.,UpdateServerStatusCommand) will receive aReplicationTimingswith null metrics, which should no-op gracefully.
1096-1118: LGTM!The timing lifecycle is well-structured:
- Create per-command timings (line 1107)
- Instrument execution phases in
blockingExecute(line 1109)- Record metrics in
finallyblock (line 1113)The
finallyblock ensures metrics are always recorded, even on exceptions. Debug logging (line 1114) aids troubleshooting.
1120-1183: LGTM!The timing instrumentation comprehensively brackets all replication phases:
- Lock acquisition (via
safeLock, line 1123)- Log replay (lines 1131-1141)
- Command execution (lines 1143-1149)
- Log store (lines 1151-1171)
Each phase uses
try-finallyto guarantee timing endpoints are recorded. If an earlier phase fails, subsequent phases will record zero duration, which is acceptable and already acknowledged in past reviews.
...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java(7 hunks)server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (2)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)
CentralDogma(200-1392)testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CentralDogmaReplicationExtension.java (1)
CentralDogmaReplicationExtension(62-266)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
RepositoryCommand(35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-windows-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
- GitHub Check: build-ubuntu-latest-jdk-17-coverage
- GitHub Check: build-ubuntu-latest-jdk-21-snapshot
- GitHub Check: build-macos-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-11
- GitHub Check: flaky-tests
- GitHub Check: docker-build
- GitHub Check: lint
🔇 Additional comments (15)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1)
21-57: LGTM! Clean interface design with appropriate factory pattern.The interface defines a clear contract for replication timing lifecycle events, and the factory method appropriately returns either a no-op or functional implementation based on whether metrics are provided.
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java (1)
19-65: LGTM! Appropriate enum singleton pattern for no-op implementation.The no-op implementation correctly stubs all timing methods and provides a suitable text representation.
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (3)
39-49: LGTM! Good test isolation with per-server meter registries.Using individual
SimpleMeterRegistryinstances for each server enables clear verification of which servers emit replication metrics.
59-68: LGTM! Appropriate async assertion for metric population.Using
await().untilAssertedcorrectly handles the asynchronous nature of metric recording after replication events.
70-79: Ignore this review comment—the assertion logic is correct.The
noneSatisfypattern withassertThat(key).startsWith("replication.")is semantically sound. In AssertJ, when an assertion inside the consumer throws (fails), that element is not satisfied. Thus:
- If a key starts with
"replication."→ assertion passes → element satisfied → test fails (correct; we want no replication metrics on non-primary servers)- If a key doesn't start with
"replication."→ assertion fails → element not satisfied → contributes tononeSatisfypass (correct)The test correctly verifies that non-primary servers have no replication metrics. This is the intended behavior, not a reversed logic issue.
Likely an incorrect or invalid review comment.
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java (3)
25-48: LGTM! Well-structured timing state management.The fields appropriately capture timing information for each replication phase, with dedicated flags to track completion status for command execution, log replay, and log store phases.
54-117: LGTM! Correct timing capture for all lifecycle phases.Each start method captures
System.nanoTime()appropriately, and end methods correctly calculate durations and update completion flags.
139-162: LGTM! Comprehensive timing summary formatting.The
toText()method appropriately returns a placeholder when incomplete and provides a detailed breakdown of all timing phases when complete, including total elapsed time.server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (7)
95-95: LGTM! Appropriate import for repository command handling.The import of
RepositoryCommandaligns with the new timing subsystem that creates metrics per repository.
131-131: LGTM! Consolidation from lock timers to comprehensive replication metrics.Replacing the
lockAcquiredTimersmap withreplicationTimingsprovides a unified metrics structure capturing multiple replication phases per project.
912-969: LGTM! Proper lock acquisition timing instrumentation.The method correctly instruments lock acquisition timing by calling
startLockAcquisition()before the lock attempt andendLockAcquisition()after completion, passing the success/failure status.
971-980: LGTM! Lock release timing captured with proper exception handling.The try-finally block ensures
endLockRelease()is always called even if the mutex release throws an exception.
1085-1095: LGTM! Appropriate per-repository metrics creation.Creating
ReplicationMetricsper project name forRepositoryCommandinstances enables granular performance tracking, while returning no-op timings for other command types avoids unnecessary overhead.
1099-1125: LGTM! Proper executor timing and metrics recording.The timing lifecycle is correctly instrumented:
- Executor submission and queue latency captured
- Timings passed through to
blockingExecute()- Metrics recorded in finally block regardless of success/failure
- Debug logging provides visibility into timing breakdowns
1127-1190: LGTM! Comprehensive replication phase timing instrumentation.All critical replication phases are properly instrumented with try-finally blocks:
- Log replay (lines 1138-1148)
- Command execution (lines 1150-1156)
- Log store (lines 1158-1178)
This ensures timing data is captured even if exceptions occur, providing accurate performance insights.
...in/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (3)
53-59: Clarify the server indexing relationship.The test uses
replica.servers().get(0)to perform operations but retrieves metrics frommeterRegistryMap.get(1). While this appears to be correct (sinceservers()is 0-indexed andserverIdis 1-based), the mixed indexing schemes can be confusing.Consider adding a comment to clarify the relationship, or retrieve the registry using a more explicit approach:
+ // Note: servers() list is 0-indexed, but serverId in configureEach starts from 1 + // So servers().get(0) corresponds to serverId=1 final SimpleMeterRegistry meterRegistry0 = meterRegistryMap.get(1);Or refactor to make the relationship more explicit by storing registries with a consistent key (e.g., using server instance reference or documenting the mapping).
63-63: Consider clarifying the incomplete tag pattern.The metric pattern
"replication.lock.acquisition.percentile#value{acquired="intentionally omits the tag value to match bothacquired=trueandacquired=falsevariants usingstartsWith. While functionally correct, this pattern appears incomplete and may confuse readers.Consider adding a comment to clarify the intent:
+ // Match both acquired=true and acquired=false tags assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=");Alternatively, you could check for both tag values explicitly:
assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=true}"); assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=false}");
71-75: Consider improving assertion error messages.The current implementation using
anySatisfywill work correctly, but when it fails, the error message may not clearly indicate which metric name was expected.Consider enhancing the assertion for better failure diagnostics:
private static void assertMetricExists(Map<String, Double> metrics, String metricName) { - assertThat(metrics).anySatisfy((name, value) -> { - assertThat(name).startsWith(metricName); - }); + assertThat(metrics.keySet()) + .as("Expected to find metric starting with: %s", metricName) + .anyMatch(name -> name.startsWith(metricName)); }This provides a clearer error message showing the expected metric prefix when the assertion fails.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (2)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)
CentralDogma(200-1392)testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CentralDogmaReplicationExtension.java (1)
CentralDogmaReplicationExtension(62-266)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-windows-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-17-coverage
- GitHub Check: build-macos-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
- GitHub Check: flaky-tests
- GitHub Check: build-ubuntu-latest-jdk-11
- GitHub Check: docker-build
- GitHub Check: build-ubuntu-latest-jdk-21-snapshot
- GitHub Check: lint
🔇 Additional comments (1)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1)
62-67: No issues found. The test metrics align correctly with the ReplicationMetrics implementation.All six metrics checked in lines 62-67 are properly created by ReplicationMetrics, including
replication.lock.release(line 48 of ReplicationMetrics.java), which is intentionally part of the system. The test assertion at line 63 correctly uses a partial match pattern to verify both the success (acquired=true) and failure (acquired=false) variants of thereplication.lock.acquisitionmetric.
|
|
||
| private <T> ReplicationTimings newReplicationTimings(Command<T> command) { | ||
| ReplicationMetrics metrics = null; | ||
| if (command instanceof RepositoryCommand) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; If only RepositoryCommands are recorded, I don't see a reason not to also record per repository. As is seems fine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to record metrics per repository in MetricCollectingService, but it caused excessive memory usage and frequent GC, so I took a more conservative approach. If we later find the project-level metrics are insufficient, we may consider tagging repositories.
minwoox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Motivation:
Currently, there are no metrics to identify wihch part of a
Commandexecution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric.Modifications:
ReplicationTimingsto recode the timings.ReplicationMetricsto export the recorded metrics per project through Micrometer.Result:
You can now use the following four metrics to measure the write performance of Central Dogma.
replication.executor.queue.latencyreplication.lock.acquisitionreplication.command.executionreplication.log.replayreplication.log.storeSummary by CodeRabbit