Record ZooKeeperCommandExecutor timings #1191

ikhoon · 2025-10-01T03:06:15Z

Motivation:

Currently, there are no metrics to identify wihch part of a Command execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric.

Command execution time
Log replay time
Log store time

Modifications:

Add ReplicationTimings to recode the timings.
Add ReplicationMetrics to export the recorded metrics per project through Micrometer.

Result:

You can now use the following four metrics to measure the write performance of Central Dogma.

replication.executor.queue.latency
replication.lock.acquisition
replication.command.execution
replication.log.replay
replication.log.store

Summary by CodeRabbit

New Features
- Improved replication observability with per-project timing metrics covering queue latency, lock acquisition (success/failure), lock release, command execution, log replay, and log storage; timings are recorded and logged end-to-end.
- Added a no-op timing option for deployments that disable metric collection.
Tests
- End-to-end multi-node test added to validate exposure of the new replication metrics in registries.

Motivation: Currently, there are no metrics to identify wihch part of a `Command` execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric. - Command execution time - Log replay time - Log store time Modifications: - Add `ReplicationTimings` to recode the timings. - Add `ReplicationMetrics` to export the recorded metrics per project through Micrometer. Result: You can now use the following four metrics to measure the write performance of Central Dogma. - `replication.lock.waiting` - `replication.command.execution` - `replication.log.replay` - `replication.log.store`

coderabbitai · 2025-10-01T03:06:25Z

Walkthrough

Adds a replication timing subsystem (metrics, interface, noop and default implementations), instruments ZooKeeperCommandExecutor with timing hooks across executor queue, lock acquisition/release, log replay, command execution, and log store phases, and introduces tests validating recorded metrics.

Changes

Cohort / File(s)	Summary
Timing metrics class `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java`	New package-private class holding project-scoped Micrometer `Timer`s for executor queue latency, lock acquire (success/failure), lock release, command execution, log replay, and log store.
Timing interface & factory `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java`	New public interface declaring lifecycle methods for timing phases and a static `of(`@nullable `ReplicationMetrics)` factory that returns either a noop or default implementation.
Timing implementations `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java`, `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java`	`DefaultReplicationTimings`: concrete package-private implementation that records phase start/end times (using `System.nanoTime()`), publishes durations to `ReplicationMetrics`, and formats a textual summary. `NoopReplicationTimings`: singleton no-op implementation.
Executor instrumentation `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java`	Replaces per-project Timer map with `Map<String, ReplicationMetrics> replicationTimings`; adds `newReplicationTimings(...)`; threads `ReplicationTimings` through `doExecute`/`blockingExecute`/`safeLock`/`safeRelease`; inserts timing hooks around executor submit/execution, lock acquisition/release, log replay, command execution, and log store; removes old Timer-based recording and related types.
Tests `server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java`	New integration test that boots a 3-node replication setup with per-server `SimpleMeterRegistry`, performs a push, and asserts presence of replication-prefixed percentile metrics (queue latency, lock acquire, lock release, command execution, log replay, log store).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ZKExec as ZooKeeperCommandExecutor
    participant Timings as ReplicationTimings
    participant Metrics as ReplicationMetrics
    participant Lock as InterProcessMutex

    Client->>ZKExec: doExecute(command)
    ZKExec->>ZKExec: timings = newReplicationTimings(command)
    ZKExec->>Timings: startExecutorSubmit()
    ZKExec->>Timings: startExecutorExecution()
    ZKExec->>ZKExec: blockingExecute(command, timings)

    alt acquire lock
        ZKExec->>Timings: startLockAcquisition(now)
        ZKExec->>Lock: acquire()
        Lock-->>ZKExec: acquired
        ZKExec->>Timings: endLockAcquisition(true)
    else fail to acquire
        ZKExec->>Timings: endLockAcquisition(false)
    end

    ZKExec->>Timings: startLogReplay()
    ZKExec->>ZKExec: replayLogs()
    ZKExec->>Timings: endLogReplay()

    ZKExec->>Timings: startCommandExecution()
    ZKExec->>ZKExec: command.execute()
    ZKExec->>Timings: endCommandExecution()

    ZKExec->>Timings: startLogStore()
    ZKExec->>ZKExec: storeLogs()
    ZKExec->>Timings: endLogStore()

    ZKExec->>Timings: startLockRelease()
    ZKExec->>Lock: release()
    ZKExec->>Timings: endLockRelease()

    ZKExec->>Timings: record()
    Timings->>Metrics: recordTimer(duration)  %% Metrics stores to Micrometer
    Metrics-->>Timings: ok
    ZKExec->>Timings: toText()

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas needing extra attention:
- DefaultReplicationTimings.record() conditional logic and unit conversions.
- ZooKeeperCommandExecutor refactor: method signature changes and correct propagation of timings into lock/replay/store paths.
- Test: ensure meter registry wiring per-node matches production registration.

Possibly related PRs

Log ZooKeeper lock acquisition duration for performance monitoring #1182: Earlier changes added per-project Timer-based lock-acquisition recording in ZooKeeperCommandExecutor; this PR replaces that approach with the ReplicationMetrics/ReplicationTimings subsystem and removes the previous Timer/map logic.

Suggested reviewers

trustin
jrhee17

Poem

🐰
I hop through queues and mutex gates,
I note each nano as it waits,
Replay, execute, and store with cheer,
Timings logged — both far and near,
Metrics hum; the rabbit claps its ears.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.90% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding timing metrics recording to ZooKeeperCommandExecutor for monitoring command execution performance.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd34da0 and 6a58a9f.

📒 Files selected for processing (3)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)

server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)

RepositoryCommand (35-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: build-ubuntu-latest-jdk-17-coverage
GitHub Check: build-ubuntu-latest-jdk-11
GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
GitHub Check: build-macos-latest-jdk-21
GitHub Check: build-windows-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-21-snapshot
GitHub Check: flaky-tests
GitHub Check: lint

coderabbitai · 2025-10-01T03:10:11Z

.../src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java

+    void record() {
+        if (metrics == null) {
+            return;
+        }
+
+        if (lockAcquired) {
+            metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+        } else {
+            metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+        }
+        metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
+        metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
+        metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+    }


⚠️ Potential issue | 🟠 Major

Avoid recording metrics for phases that never started.

If we fail before startCommandExecution()/startLogReplay()/startLogStore() runs (e.g., lock acquisition timeout), these timers still record(0, …), bumping the sample count and hiding that the phase never happened. Guard each record with *_StartNanos != 0 so only completed phases contribute measurements, leaving lock failures to be represented solely by the lock timer.

Apply this diff:

@@ - if (lockAcquired) { - metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); - } else { - metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); - } - metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); - metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); - metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); + if (lockAcquisitionStartNanos != 0) { + if (lockAcquired) { + metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); + } else { + metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); + } + } + if (commandExecutionStartNanos != 0) { + metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); + } + if (logReplayStartNanos != 0) { + metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); + } + if (logStoreStartNanos != 0) { + metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

void record() {

if (metrics == null) {

return;

}

if (lockAcquired) {

metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

} else {

metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

}

metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);

metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);

metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);

}

void record() {

if (metrics == null) {

return;

}

if (lockAcquisitionStartNanos != 0) {

if (lockAcquired) {

metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

} else {

metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

}

}

if (commandExecutionStartNanos != 0) {

metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);

}

if (logReplayStartNanos != 0) {

metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);

}

if (logStoreStartNanos != 0) {

metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);

}

}

🤖 Prompt for AI Agents

In server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java around lines 77 to 90, the record() method currently records zero-duration metrics for phases that never started; update the method to only call each phase timer.record(...) if that phase's start timestamp field (e.g., commandExecutionStartNanos, logReplayStartNanos, logStoreStartNanos) is non-zero so that only completed phases contribute measurements; keep the existing lock timer behavior (it should still record based on lockAcquired/lockAcquisitionDurationNanos) and add simple guards like if (commandExecutionStartNanos != 0) { ... } for each of the three phase timers.

minwoox

👍 👍 👍

I'm not an expert of ZooKeeper but how about also collecting the lock release time since we collect all kinds of Replication metrics?

centraldogma/server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java

Line 973 in 6a58a9f

return () -> safeRelease(mtx);

...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java

…mmand-executor-metrics

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (1)

214-220: Clarify the intent of the TODO comment.

The TODO comment indicates this debug logging is temporary and should be removed after debugging. Should this be removed before merging this PR, or is it intentional to keep it temporarily for production debugging?

Since this is debug-level logging, it won't appear in production logs unless explicitly enabled, which may justify keeping it temporarily. However, temporary code with TODOs can accumulate if not tracked properly.
server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (1)
56-57: Consider conditional timing capture to avoid overhead when debug is disabled.

The start time is captured unconditionally but only used when debug logging is enabled (line 87). While the overhead of System.nanoTime() is minimal, you could optimize by checking isDebugEnabled() first.

Apply this diff if you want to avoid the overhead:
-        final long startTime = System.nanoTime();
-
         try {
+            final long startTime;
+            if (logger.isDebugEnabled()) {
+                startTime = System.nanoTime();
+            }
             final JsonNode node = (JsonNode) delegate.convertRequest(ctx, request, JsonNode.class, null);
And update the finally block:
         } finally {
             if (logger.isDebugEnabled()) {
-                // TODO(ikhoon): Remove the logging after debugging.
-                logger.debug("{}: took {} to convert request to changes",
-                             ctx, TextFormatter.elapsed(startTime, System.nanoTime()));
+                final long endTime = System.nanoTime();
+                logger.debug("{}: took {} to convert request to changes",
+                             ctx, TextFormatter.elapsed(startTime, endTime));
             }
         }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a58a9f and 6c53dc9.

📒 Files selected for processing (5)

server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (3 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (2 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)

server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)

RepositoryCommand (35-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-windows-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-11
GitHub Check: docker-build
GitHub Check: build-ubuntu-latest-jdk-17-coverage
GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
GitHub Check: build-macos-latest-jdk-21
GitHub Check: flaky-tests
GitHub Check: build-ubuntu-latest-jdk-21-snapshot
GitHub Check: lint

🔇 Additional comments (13)

server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (2)

43-44: LGTM!

Standard SLF4J imports for logging infrastructure.

107-108: LGTM!

Logger field follows standard SLF4J patterns.

server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (5)

26-28: LGTM!

The logging imports and TextFormatter are standard additions for timing instrumentation.

Also applies to: 35-35

47-48: LGTM!

Standard logger declaration following conventions.

81-85: LGTM!

The array processing logic is unchanged, just moved within the try block for timing instrumentation.

86-92: Clarify the intent of the TODO comment.

The TODO comment suggests removing this logging after debugging (line 88), but the PR's objective is to add permanent timing metrics. Please clarify:

Is this debug logging temporary for initial rollout troubleshooting?

Or should it be permanent instrumentation, in which case the TODO should be removed?

If it's temporary, consider tracking its removal with an issue. If it's permanent, the comment should be updated or removed.

64-79: All verification checks pass - code changes are sound and backward compatible.

The codebase already exercises single-change scenarios in tests (e.g., ZooKeeperCommandExecutorTest uses ImmutableList.of(pushChange) in multiple places), confirming the early-return optimization at line 77 is valid. The stricter validation for malformed input is a beneficial defensive improvement.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (2)

86-98: LGTM!

Using projectName as the sole equality determinant is appropriate, as each project should have exactly one ReplicationMetrics instance. The timer fields are derived from projectName and don't need to participate in equality checks.

38-56: Verify the PR description against the implementation.

The review comment asserts that the PR description specifies four metrics, but I cannot access the actual PR description to confirm this claim. The implementation in ReplicationMetrics.java is internally consistent with seven properly defined and exposed timers, but without verifying the PR description's stated scope, I cannot determine whether:

The implementation evolved beyond the initial PR scope (legitimate change)

The review comment's characterization of the PR is inaccurate

The PR description was updated but the review comment was not

Please confirm the PR description's intended metric set and whether the seven metrics in the current implementation align with the PR's stated objectives.
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (4)
912-969: Verify lock release timing instrumentation.

The lock acquisition timing is correctly instrumented (lines 914, 953), but the lock release timing is unclear. The SafeCloseable returned at line 968 invokes safeRelease(mtx), yet I don't see corresponding startLockRelease/endLockRelease calls, despite ReplicationMetrics exposing a lockReleaseTimer.

If lock release timing is handled within ReplicationTimings or elsewhere, this is fine. Otherwise, consider whether release timing should be instrumented:
return () -> {
    timings.startLockRelease();
    try {
        safeRelease(mtx);
    } finally {
        timings.endLockRelease();
    }
};
1082-1092: LGTM!

Scoping metrics to RepositoryCommand is appropriate, as replication metrics are project-specific. Non-repository commands (e.g., UpdateServerStatusCommand) will receive a ReplicationTimings with null metrics, which should no-op gracefully.

1096-1118: LGTM!

The timing lifecycle is well-structured:

Create per-command timings (line 1107)

Instrument execution phases in blockingExecute (line 1109)

Record metrics in finally block (line 1113)

The finally block ensures metrics are always recorded, even on exceptions. Debug logging (line 1114) aids troubleshooting.

1120-1183: LGTM!

The timing instrumentation comprehensively brackets all replication phases:

Lock acquisition (via safeLock, line 1123)

Log replay (lines 1131-1141)

Command execution (lines 1143-1149)

Log store (lines 1151-1171)

Each phase uses try-finally to guarantee timing endpoints are recorded. If an earlier phase fails, subsequent phases will record zero duration, which is acceptable and already acknowledged in past reviews.

...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c53dc9 and f0b6a8f.

📒 Files selected for processing (5)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (7 hunks)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (2)

server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)

CentralDogma (200-1392)

testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CentralDogmaReplicationExtension.java (1)

CentralDogmaReplicationExtension (62-266)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)

server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)

RepositoryCommand (35-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-windows-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
GitHub Check: build-ubuntu-latest-jdk-17-coverage
GitHub Check: build-ubuntu-latest-jdk-21-snapshot
GitHub Check: build-macos-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-11
GitHub Check: flaky-tests
GitHub Check: docker-build
GitHub Check: lint

🔇 Additional comments (15)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1)

21-57: LGTM! Clean interface design with appropriate factory pattern.

The interface defines a clear contract for replication timing lifecycle events, and the factory method appropriately returns either a no-op or functional implementation based on whether metrics are provided.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java (1)

19-65: LGTM! Appropriate enum singleton pattern for no-op implementation.

The no-op implementation correctly stubs all timing methods and provides a suitable text representation.

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (3)

39-49: LGTM! Good test isolation with per-server meter registries.

Using individual SimpleMeterRegistry instances for each server enables clear verification of which servers emit replication metrics.

59-68: LGTM! Appropriate async assertion for metric population.

Using await().untilAsserted correctly handles the asynchronous nature of metric recording after replication events.

70-79: Ignore this review comment—the assertion logic is correct.

The noneSatisfy pattern with assertThat(key).startsWith("replication.") is semantically sound. In AssertJ, when an assertion inside the consumer throws (fails), that element is not satisfied. Thus:

If a key starts with "replication." → assertion passes → element satisfied → test fails (correct; we want no replication metrics on non-primary servers)

If a key doesn't start with "replication." → assertion fails → element not satisfied → contributes to noneSatisfy pass (correct)

The test correctly verifies that non-primary servers have no replication metrics. This is the intended behavior, not a reversed logic issue.

Likely an incorrect or invalid review comment.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java (3)

25-48: LGTM! Well-structured timing state management.

The fields appropriately capture timing information for each replication phase, with dedicated flags to track completion status for command execution, log replay, and log store phases.

54-117: LGTM! Correct timing capture for all lifecycle phases.

Each start method captures System.nanoTime() appropriately, and end methods correctly calculate durations and update completion flags.

139-162: LGTM! Comprehensive timing summary formatting.

The toText() method appropriately returns a placeholder when incomplete and provides a detailed breakdown of all timing phases when complete, including total elapsed time.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (7)

95-95: LGTM! Appropriate import for repository command handling.

The import of RepositoryCommand aligns with the new timing subsystem that creates metrics per repository.

131-131: LGTM! Consolidation from lock timers to comprehensive replication metrics.

Replacing the lockAcquiredTimers map with replicationTimings provides a unified metrics structure capturing multiple replication phases per project.

912-969: LGTM! Proper lock acquisition timing instrumentation.

The method correctly instruments lock acquisition timing by calling startLockAcquisition() before the lock attempt and endLockAcquisition() after completion, passing the success/failure status.

971-980: LGTM! Lock release timing captured with proper exception handling.

The try-finally block ensures endLockRelease() is always called even if the mutex release throws an exception.

1085-1095: LGTM! Appropriate per-repository metrics creation.

Creating ReplicationMetrics per project name for RepositoryCommand instances enables granular performance tracking, while returning no-op timings for other command types avoids unnecessary overhead.

1099-1125: LGTM! Proper executor timing and metrics recording.

The timing lifecycle is correctly instrumented:

Executor submission and queue latency captured

Timings passed through to blockingExecute()

Metrics recorded in finally block regardless of success/failure

Debug logging provides visibility into timing breakdowns

1127-1190: LGTM! Comprehensive replication phase timing instrumentation.

All critical replication phases are properly instrumented with try-finally blocks:

Log replay (lines 1138-1148)

Command execution (lines 1150-1156)

Log store (lines 1158-1178)

This ensures timing data is captured even if exceptions occur, providing accurate performance insights.

...in/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (3)
53-59: Clarify the server indexing relationship.

The test uses replica.servers().get(0) to perform operations but retrieves metrics from meterRegistryMap.get(1). While this appears to be correct (since servers() is 0-indexed and serverId is 1-based), the mixed indexing schemes can be confusing.

Consider adding a comment to clarify the relationship, or retrieve the registry using a more explicit approach:
+        // Note: servers() list is 0-indexed, but serverId in configureEach starts from 1
+        // So servers().get(0) corresponds to serverId=1
         final SimpleMeterRegistry meterRegistry0 = meterRegistryMap.get(1);
Or refactor to make the relationship more explicit by storing registries with a consistent key (e.g., using server instance reference or documenting the mapping).

63-63: Consider clarifying the incomplete tag pattern.

The metric pattern "replication.lock.acquisition.percentile#value{acquired=" intentionally omits the tag value to match both acquired=true and acquired=false variants using startsWith. While functionally correct, this pattern appears incomplete and may confuse readers.

Consider adding a comment to clarify the intent:
+            // Match both acquired=true and acquired=false tags
             assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=");
Alternatively, you could check for both tag values explicitly:
assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=true}");
assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=false}");
71-75: Consider improving assertion error messages.

The current implementation using anySatisfy will work correctly, but when it fails, the error message may not clearly indicate which metric name was expected.

Consider enhancing the assertion for better failure diagnostics:
 private static void assertMetricExists(Map<String, Double> metrics, String metricName) {
-    assertThat(metrics).anySatisfy((name, value) -> {
-        assertThat(name).startsWith(metricName);
-    });
+    assertThat(metrics.keySet())
+        .as("Expected to find metric starting with: %s", metricName)
+        .anyMatch(name -> name.startsWith(metricName));
 }
This provides a clearer error message showing the expected metric prefix when the assertion fails.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92174e6 and f3eb9c8.

📒 Files selected for processing (1)

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (2)

server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)

CentralDogma (200-1392)

testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CentralDogmaReplicationExtension.java (1)

CentralDogmaReplicationExtension (62-266)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-windows-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-17-coverage
GitHub Check: build-macos-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
GitHub Check: flaky-tests
GitHub Check: build-ubuntu-latest-jdk-11
GitHub Check: docker-build
GitHub Check: build-ubuntu-latest-jdk-21-snapshot
GitHub Check: lint

🔇 Additional comments (1)

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1)

62-67: No issues found. The test metrics align correctly with the ReplicationMetrics implementation.

All six metrics checked in lines 62-67 are properly created by ReplicationMetrics, including replication.lock.release (line 48 of ReplicationMetrics.java), which is intentionally part of the system. The test assertion at line 63 correctly uses a partial match pattern to verify both the success (acquired=true) and failure (acquired=false) variants of the replication.lock.acquisition metric.

jrhee17 · 2025-11-14T01:14:42Z

...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java


+    private <T> ReplicationTimings newReplicationTimings(Command<T> command) {
+        ReplicationMetrics metrics = null;
+        if (command instanceof RepositoryCommand) {


nit; If only RepositoryCommands are recorded, I don't see a reason not to also record per repository. As is seems fine though.

We used to record metrics per repository in MetricCollectingService, but it caused excessive memory usage and frequent GC, so I took a more conservative approach. If we later find the project-level metrics are insufficient, we may consider tagging repositories.

minwoox

👍

ikhoon added this to the 0.78.0 milestone Oct 1, 2025

ikhoon requested review from jrhee17, minwoox and trustin as code owners October 1, 2025 03:06

ikhoon added the improvement label Oct 1, 2025

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

minwoox approved these changes Oct 1, 2025

View reviewed changes

add debug code

1691555

jrhee17 approved these changes Oct 17, 2025

View reviewed changes

...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java Show resolved Hide resolved

ikhoon modified the milestones: 0.78.0, 0.79.0 Nov 12, 2025

Merge branch 'main' of github.com:line/centraldogma into zookeeper-co…

6c53dc9

…mmand-executor-metrics

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java Show resolved Hide resolved

address comments by @jrhee17

f0b6a8f

ikhoon modified the milestones: 0.79.0, 0.78.0 Nov 13, 2025

ikhoon requested review from jrhee17 and minwoox November 13, 2025 07:55

revert debug code

a57272f

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

...in/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java Show resolved Hide resolved

ikhoon added 2 commits November 13, 2025 17:35

fix tests

92174e6

fix tests

f3eb9c8

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

jrhee17 approved these changes Nov 14, 2025

View reviewed changes

minwoox approved these changes Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Record ZooKeeperCommandExecutor timings #1191

Record ZooKeeperCommandExecutor timings #1191

Uh oh!

ikhoon commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

minwoox left a comment •

edited

Loading

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

jrhee17 Nov 14, 2025

Uh oh!

ikhoon Nov 14, 2025

Uh oh!

minwoox left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Record ZooKeeperCommandExecutor timings #1191

Are you sure you want to change the base?

Record ZooKeeperCommandExecutor timings #1191

Uh oh!

Conversation

ikhoon commented Oct 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

minwoox left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jrhee17 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ikhoon Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

minwoox left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikhoon commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 1, 2025 •

edited

Loading

minwoox left a comment •

edited

Loading