Skip to content

Conversation

@ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Oct 1, 2025

Motivation:

Currently, there are no metrics to identify wihch part of a Command execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric.

  • Command execution time
  • Log replay time
  • Log store time

Modifications:

  • Add ReplicationTimings to recode the timings.
  • Add ReplicationMetrics to export the recorded metrics per project through Micrometer.

Result:

You can now use the following four metrics to measure the write performance of Central Dogma.

  • replication.executor.queue.latency
  • replication.lock.acquisition
  • replication.command.execution
  • replication.log.replay
  • replication.log.store

Summary by CodeRabbit

  • New Features
    • Improved replication observability with per-project timing metrics covering queue latency, lock acquisition (success/failure), lock release, command execution, log replay, and log storage; timings are recorded and logged end-to-end.
    • Added a no-op timing option for deployments that disable metric collection.
  • Tests
    • End-to-end multi-node test added to validate exposure of the new replication metrics in registries.

Motivation:

Currently, there are no metrics to identify wihch part of a `Command`
execution is slow. To improve the write performance of Central Dogma, I
propose adding three new metrics in addition to the existing lock
acquisition metric.

- Command execution time
- Log replay time
- Log store time

Modifications:

- Add `ReplicationTimings` to recode the timings.
- Add `ReplicationMetrics` to export the recorded metrics per project
  through Micrometer.

Result:

You can now use the following four metrics to measure the write
performance of Central Dogma.
- `replication.lock.waiting`
- `replication.command.execution`
- `replication.log.replay`
- `replication.log.store`
@ikhoon ikhoon added this to the 0.78.0 milestone Oct 1, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 1, 2025

Walkthrough

Adds a replication timing subsystem (metrics, interface, noop and default implementations), instruments ZooKeeperCommandExecutor with timing hooks across executor queue, lock acquisition/release, log replay, command execution, and log store phases, and introduces tests validating recorded metrics.

Changes

Cohort / File(s) Summary
Timing metrics class
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java
New package-private class holding project-scoped Micrometer Timers for executor queue latency, lock acquire (success/failure), lock release, command execution, log replay, and log store.
Timing interface & factory
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
New public interface declaring lifecycle methods for timing phases and a static of(@nullable ReplicationMetrics) factory that returns either a noop or default implementation.
Timing implementations
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java, server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java
DefaultReplicationTimings: concrete package-private implementation that records phase start/end times (using System.nanoTime()), publishes durations to ReplicationMetrics, and formats a textual summary. NoopReplicationTimings: singleton no-op implementation.
Executor instrumentation
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java
Replaces per-project Timer map with Map<String, ReplicationMetrics> replicationTimings; adds newReplicationTimings(...); threads ReplicationTimings through doExecute/blockingExecute/safeLock/safeRelease; inserts timing hooks around executor submit/execution, lock acquisition/release, log replay, command execution, and log store; removes old Timer-based recording and related types.
Tests
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java
New integration test that boots a 3-node replication setup with per-server SimpleMeterRegistry, performs a push, and asserts presence of replication-prefixed percentile metrics (queue latency, lock acquire, lock release, command execution, log replay, log store).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ZKExec as ZooKeeperCommandExecutor
    participant Timings as ReplicationTimings
    participant Metrics as ReplicationMetrics
    participant Lock as InterProcessMutex

    Client->>ZKExec: doExecute(command)
    ZKExec->>ZKExec: timings = newReplicationTimings(command)
    ZKExec->>Timings: startExecutorSubmit()
    ZKExec->>Timings: startExecutorExecution()
    ZKExec->>ZKExec: blockingExecute(command, timings)

    alt acquire lock
        ZKExec->>Timings: startLockAcquisition(now)
        ZKExec->>Lock: acquire()
        Lock-->>ZKExec: acquired
        ZKExec->>Timings: endLockAcquisition(true)
    else fail to acquire
        ZKExec->>Timings: endLockAcquisition(false)
    end

    ZKExec->>Timings: startLogReplay()
    ZKExec->>ZKExec: replayLogs()
    ZKExec->>Timings: endLogReplay()

    ZKExec->>Timings: startCommandExecution()
    ZKExec->>ZKExec: command.execute()
    ZKExec->>Timings: endCommandExecution()

    ZKExec->>Timings: startLogStore()
    ZKExec->>ZKExec: storeLogs()
    ZKExec->>Timings: endLogStore()

    ZKExec->>Timings: startLockRelease()
    ZKExec->>Lock: release()
    ZKExec->>Timings: endLockRelease()

    ZKExec->>Timings: record()
    Timings->>Metrics: recordTimer(duration)  %% Metrics stores to Micrometer
    Metrics-->>Timings: ok
    ZKExec->>Timings: toText()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas needing extra attention:
    • DefaultReplicationTimings.record() conditional logic and unit conversions.
    • ZooKeeperCommandExecutor refactor: method signature changes and correct propagation of timings into lock/replay/store paths.
    • Test: ensure meter registry wiring per-node matches production registration.

Possibly related PRs

Suggested reviewers

  • trustin
  • jrhee17

Poem

🐰
I hop through queues and mutex gates,
I note each nano as it waits,
Replay, execute, and store with cheer,
Timings logged — both far and near,
Metrics hum; the rabbit claps its ears.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.90% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding timing metrics recording to ZooKeeperCommandExecutor for monitoring command execution performance.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd34da0 and 6a58a9f.

📒 Files selected for processing (3)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
  • RepositoryCommand (35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build-ubuntu-latest-jdk-17-coverage
  • GitHub Check: build-ubuntu-latest-jdk-11
  • GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
  • GitHub Check: build-macos-latest-jdk-21
  • GitHub Check: build-windows-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-21-snapshot
  • GitHub Check: flaky-tests
  • GitHub Check: lint

Comment on lines 77 to 90
void record() {
if (metrics == null) {
return;
}

if (lockAcquired) {
metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
} else {
metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
}
metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid recording metrics for phases that never started.

If we fail before startCommandExecution()/startLogReplay()/startLogStore() runs (e.g., lock acquisition timeout), these timers still record(0, …), bumping the sample count and hiding that the phase never happened. Guard each record with *_StartNanos != 0 so only completed phases contribute measurements, leaving lock failures to be represented solely by the lock timer.

Apply this diff:

@@
-        if (lockAcquired) {
-            metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
-        } else {
-            metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
-        }
-        metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
-        metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
-        metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+        if (lockAcquisitionStartNanos != 0) {
+            if (lockAcquired) {
+                metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+            } else {
+                metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+            }
+        }
+        if (commandExecutionStartNanos != 0) {
+            metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
+        }
+        if (logReplayStartNanos != 0) {
+            metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
+        }
+        if (logStoreStartNanos != 0) {
+            metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void record() {
if (metrics == null) {
return;
}
if (lockAcquired) {
metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
} else {
metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
}
metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
}
void record() {
if (metrics == null) {
return;
}
if (lockAcquisitionStartNanos != 0) {
if (lockAcquired) {
metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
} else {
metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
}
}
if (commandExecutionStartNanos != 0) {
metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
}
if (logReplayStartNanos != 0) {
metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
}
if (logStoreStartNanos != 0) {
metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
}
}
🤖 Prompt for AI Agents
In
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
around lines 77 to 90, the record() method currently records zero-duration
metrics for phases that never started; update the method to only call each phase
timer.record(...) if that phase's start timestamp field (e.g.,
commandExecutionStartNanos, logReplayStartNanos, logStoreStartNanos) is non-zero
so that only completed phases contribute measurements; keep the existing lock
timer behavior (it should still record based on
lockAcquired/lockAcquisitionDurationNanos) and add simple guards like if
(commandExecutionStartNanos != 0) { ... } for each of the three phase timers.

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

I'm not an expert of ZooKeeper but how about also collecting the lock release time since we collect all kinds of Replication metrics?

@ikhoon ikhoon modified the milestones: 0.78.0, 0.79.0 Nov 12, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (1)

214-220: Clarify the intent of the TODO comment.

The TODO comment indicates this debug logging is temporary and should be removed after debugging. Should this be removed before merging this PR, or is it intentional to keep it temporarily for production debugging?

Since this is debug-level logging, it won't appear in production logs unless explicitly enabled, which may justify keeping it temporarily. However, temporary code with TODOs can accumulate if not tracked properly.

server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (1)

56-57: Consider conditional timing capture to avoid overhead when debug is disabled.

The start time is captured unconditionally but only used when debug logging is enabled (line 87). While the overhead of System.nanoTime() is minimal, you could optimize by checking isDebugEnabled() first.

Apply this diff if you want to avoid the overhead:

-        final long startTime = System.nanoTime();
-
         try {
+            final long startTime;
+            if (logger.isDebugEnabled()) {
+                startTime = System.nanoTime();
+            }
             final JsonNode node = (JsonNode) delegate.convertRequest(ctx, request, JsonNode.class, null);

And update the finally block:

         } finally {
             if (logger.isDebugEnabled()) {
-                // TODO(ikhoon): Remove the logging after debugging.
-                logger.debug("{}: took {} to convert request to changes",
-                             ctx, TextFormatter.elapsed(startTime, System.nanoTime()));
+                final long endTime = System.nanoTime();
+                logger.debug("{}: took {} to convert request to changes",
+                             ctx, TextFormatter.elapsed(startTime, endTime));
             }
         }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a58a9f and 6c53dc9.

📒 Files selected for processing (5)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (3 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (2 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
  • RepositoryCommand (35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-windows-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-11
  • GitHub Check: docker-build
  • GitHub Check: build-ubuntu-latest-jdk-17-coverage
  • GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
  • GitHub Check: build-macos-latest-jdk-21
  • GitHub Check: flaky-tests
  • GitHub Check: build-ubuntu-latest-jdk-21-snapshot
  • GitHub Check: lint
🔇 Additional comments (13)
server/src/main/java/com/linecorp/centraldogma/server/internal/api/ContentServiceV1.java (2)

43-44: LGTM!

Standard SLF4J imports for logging infrastructure.


107-108: LGTM!

Logger field follows standard SLF4J patterns.

server/src/main/java/com/linecorp/centraldogma/server/internal/api/converter/ChangesRequestConverter.java (5)

26-28: LGTM!

The logging imports and TextFormatter are standard additions for timing instrumentation.

Also applies to: 35-35


47-48: LGTM!

Standard logger declaration following conventions.


81-85: LGTM!

The array processing logic is unchanged, just moved within the try block for timing instrumentation.


86-92: Clarify the intent of the TODO comment.

The TODO comment suggests removing this logging after debugging (line 88), but the PR's objective is to add permanent timing metrics. Please clarify:

  1. Is this debug logging temporary for initial rollout troubleshooting?
  2. Or should it be permanent instrumentation, in which case the TODO should be removed?

If it's temporary, consider tracking its removal with an issue. If it's permanent, the comment should be updated or removed.


64-79: All verification checks pass - code changes are sound and backward compatible.

The codebase already exercises single-change scenarios in tests (e.g., ZooKeeperCommandExecutorTest uses ImmutableList.of(pushChange) in multiple places), confirming the early-return optimization at line 77 is valid. The stricter validation for malformed input is a beneficial defensive improvement.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (2)

86-98: LGTM!

Using projectName as the sole equality determinant is appropriate, as each project should have exactly one ReplicationMetrics instance. The timer fields are derived from projectName and don't need to participate in equality checks.


38-56: Verify the PR description against the implementation.

The review comment asserts that the PR description specifies four metrics, but I cannot access the actual PR description to confirm this claim. The implementation in ReplicationMetrics.java is internally consistent with seven properly defined and exposed timers, but without verifying the PR description's stated scope, I cannot determine whether:

  • The implementation evolved beyond the initial PR scope (legitimate change)
  • The review comment's characterization of the PR is inaccurate
  • The PR description was updated but the review comment was not

Please confirm the PR description's intended metric set and whether the seven metrics in the current implementation align with the PR's stated objectives.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (4)

912-969: Verify lock release timing instrumentation.

The lock acquisition timing is correctly instrumented (lines 914, 953), but the lock release timing is unclear. The SafeCloseable returned at line 968 invokes safeRelease(mtx), yet I don't see corresponding startLockRelease/endLockRelease calls, despite ReplicationMetrics exposing a lockReleaseTimer.

If lock release timing is handled within ReplicationTimings or elsewhere, this is fine. Otherwise, consider whether release timing should be instrumented:

return () -> {
    timings.startLockRelease();
    try {
        safeRelease(mtx);
    } finally {
        timings.endLockRelease();
    }
};

1082-1092: LGTM!

Scoping metrics to RepositoryCommand is appropriate, as replication metrics are project-specific. Non-repository commands (e.g., UpdateServerStatusCommand) will receive a ReplicationTimings with null metrics, which should no-op gracefully.


1096-1118: LGTM!

The timing lifecycle is well-structured:

  1. Create per-command timings (line 1107)
  2. Instrument execution phases in blockingExecute (line 1109)
  3. Record metrics in finally block (line 1113)

The finally block ensures metrics are always recorded, even on exceptions. Debug logging (line 1114) aids troubleshooting.


1120-1183: LGTM!

The timing instrumentation comprehensively brackets all replication phases:

  • Lock acquisition (via safeLock, line 1123)
  • Log replay (lines 1131-1141)
  • Command execution (lines 1143-1149)
  • Log store (lines 1151-1171)

Each phase uses try-finally to guarantee timing endpoints are recorded. If an earlier phase fails, subsequent phases will record zero duration, which is acceptable and already acknowledged in past reviews.

@ikhoon ikhoon modified the milestones: 0.79.0, 0.78.0 Nov 13, 2025
@ikhoon ikhoon requested review from jrhee17 and minwoox November 13, 2025 07:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c53dc9 and f0b6a8f.

📒 Files selected for processing (5)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (7 hunks)
  • server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (2)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)
  • CentralDogma (200-1392)
testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CentralDogmaReplicationExtension.java (1)
  • CentralDogmaReplicationExtension (62-266)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
  • RepositoryCommand (35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-windows-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
  • GitHub Check: build-ubuntu-latest-jdk-17-coverage
  • GitHub Check: build-ubuntu-latest-jdk-21-snapshot
  • GitHub Check: build-macos-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-11
  • GitHub Check: flaky-tests
  • GitHub Check: docker-build
  • GitHub Check: lint
🔇 Additional comments (15)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1)

21-57: LGTM! Clean interface design with appropriate factory pattern.

The interface defines a clear contract for replication timing lifecycle events, and the factory method appropriately returns either a no-op or functional implementation based on whether metrics are provided.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/NoopReplicationTimings.java (1)

19-65: LGTM! Appropriate enum singleton pattern for no-op implementation.

The no-op implementation correctly stubs all timing methods and provides a suitable text representation.

server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (3)

39-49: LGTM! Good test isolation with per-server meter registries.

Using individual SimpleMeterRegistry instances for each server enables clear verification of which servers emit replication metrics.


59-68: LGTM! Appropriate async assertion for metric population.

Using await().untilAsserted correctly handles the asynchronous nature of metric recording after replication events.


70-79: Ignore this review comment—the assertion logic is correct.

The noneSatisfy pattern with assertThat(key).startsWith("replication.") is semantically sound. In AssertJ, when an assertion inside the consumer throws (fails), that element is not satisfied. Thus:

  • If a key starts with "replication." → assertion passes → element satisfied → test fails (correct; we want no replication metrics on non-primary servers)
  • If a key doesn't start with "replication." → assertion fails → element not satisfied → contributes to noneSatisfy pass (correct)

The test correctly verifies that non-primary servers have no replication metrics. This is the intended behavior, not a reversed logic issue.

Likely an incorrect or invalid review comment.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/DefaultReplicationTimings.java (3)

25-48: LGTM! Well-structured timing state management.

The fields appropriately capture timing information for each replication phase, with dedicated flags to track completion status for command execution, log replay, and log store phases.


54-117: LGTM! Correct timing capture for all lifecycle phases.

Each start method captures System.nanoTime() appropriately, and end methods correctly calculate durations and update completion flags.


139-162: LGTM! Comprehensive timing summary formatting.

The toText() method appropriately returns a placeholder when incomplete and provides a detailed breakdown of all timing phases when complete, including total elapsed time.

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (7)

95-95: LGTM! Appropriate import for repository command handling.

The import of RepositoryCommand aligns with the new timing subsystem that creates metrics per repository.


131-131: LGTM! Consolidation from lock timers to comprehensive replication metrics.

Replacing the lockAcquiredTimers map with replicationTimings provides a unified metrics structure capturing multiple replication phases per project.


912-969: LGTM! Proper lock acquisition timing instrumentation.

The method correctly instruments lock acquisition timing by calling startLockAcquisition() before the lock attempt and endLockAcquisition() after completion, passing the success/failure status.


971-980: LGTM! Lock release timing captured with proper exception handling.

The try-finally block ensures endLockRelease() is always called even if the mutex release throws an exception.


1085-1095: LGTM! Appropriate per-repository metrics creation.

Creating ReplicationMetrics per project name for RepositoryCommand instances enables granular performance tracking, while returning no-op timings for other command types avoids unnecessary overhead.


1099-1125: LGTM! Proper executor timing and metrics recording.

The timing lifecycle is correctly instrumented:

  • Executor submission and queue latency captured
  • Timings passed through to blockingExecute()
  • Metrics recorded in finally block regardless of success/failure
  • Debug logging provides visibility into timing breakdowns

1127-1190: LGTM! Comprehensive replication phase timing instrumentation.

All critical replication phases are properly instrumented with try-finally blocks:

  • Log replay (lines 1138-1148)
  • Command execution (lines 1150-1156)
  • Log store (lines 1158-1178)

This ensures timing data is captured even if exceptions occur, providing accurate performance insights.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (3)

53-59: Clarify the server indexing relationship.

The test uses replica.servers().get(0) to perform operations but retrieves metrics from meterRegistryMap.get(1). While this appears to be correct (since servers() is 0-indexed and serverId is 1-based), the mixed indexing schemes can be confusing.

Consider adding a comment to clarify the relationship, or retrieve the registry using a more explicit approach:

+        // Note: servers() list is 0-indexed, but serverId in configureEach starts from 1
+        // So servers().get(0) corresponds to serverId=1
         final SimpleMeterRegistry meterRegistry0 = meterRegistryMap.get(1);

Or refactor to make the relationship more explicit by storing registries with a consistent key (e.g., using server instance reference or documenting the mapping).


63-63: Consider clarifying the incomplete tag pattern.

The metric pattern "replication.lock.acquisition.percentile#value{acquired=" intentionally omits the tag value to match both acquired=true and acquired=false variants using startsWith. While functionally correct, this pattern appears incomplete and may confuse readers.

Consider adding a comment to clarify the intent:

+            // Match both acquired=true and acquired=false tags
             assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=");

Alternatively, you could check for both tag values explicitly:

assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=true}");
assertMetricExists(metrics, "replication.lock.acquisition.percentile#value{acquired=false}");

71-75: Consider improving assertion error messages.

The current implementation using anySatisfy will work correctly, but when it fails, the error message may not clearly indicate which metric name was expected.

Consider enhancing the assertion for better failure diagnostics:

 private static void assertMetricExists(Map<String, Double> metrics, String metricName) {
-    assertThat(metrics).anySatisfy((name, value) -> {
-        assertThat(name).startsWith(metricName);
-    });
+    assertThat(metrics.keySet())
+        .as("Expected to find metric starting with: %s", metricName)
+        .anyMatch(name -> name.startsWith(metricName));
 }

This provides a clearer error message showing the expected metric prefix when the assertion fails.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92174e6 and f3eb9c8.

📒 Files selected for processing (1)
  • server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (2)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)
  • CentralDogma (200-1392)
testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CentralDogmaReplicationExtension.java (1)
  • CentralDogmaReplicationExtension (62-266)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-windows-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-17-coverage
  • GitHub Check: build-macos-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
  • GitHub Check: flaky-tests
  • GitHub Check: build-ubuntu-latest-jdk-11
  • GitHub Check: docker-build
  • GitHub Check: build-ubuntu-latest-jdk-21-snapshot
  • GitHub Check: lint
🔇 Additional comments (1)
server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimingsTest.java (1)

62-67: No issues found. The test metrics align correctly with the ReplicationMetrics implementation.

All six metrics checked in lines 62-67 are properly created by ReplicationMetrics, including replication.lock.release (line 48 of ReplicationMetrics.java), which is intentionally part of the system. The test assertion at line 63 correctly uses a partial match pattern to verify both the success (acquired=true) and failure (acquired=false) variants of the replication.lock.acquisition metric.


private <T> ReplicationTimings newReplicationTimings(Command<T> command) {
ReplicationMetrics metrics = null;
if (command instanceof RepositoryCommand) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; If only RepositoryCommands are recorded, I don't see a reason not to also record per repository. As is seems fine though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to record metrics per repository in MetricCollectingService, but it caused excessive memory usage and frequent GC, so I took a more conservative approach. If we later find the project-level metrics are insufficient, we may consider tagging repositories.

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants