Skip to content

Conversation

@everton-dematos
Copy link
Contributor

@everton-dematos everton-dematos commented Jan 13, 2026

Description of Changes

This PR introduces a clock-jump recovery mechanism for Ghaf logging, designed to handle manual or abrupt realtime clock changes that may otherwise disrupt journald ordering and Alloy log shipping. This PR aims to resolve the bug described at https://jira.tii.ae/browse/SSRCSP-7772. Summary of modifications:

  • Add ghaf.logging.recovery options and clock-jump watcher + recover oneshot services.
  • Ensure alloy.service is ordered after/requires systemd-journald on client and server.
  • Implementation is centralized in modules/common/logging/common.nix and reusable across all VMs.
  • Enabled by default only for admin-vm, as it aggregates and forwards the system logs. Can be enabled for different VMs with different parameters (e.g., thresholdSeconds, intervalSeconds, etc.)
  • Server pipeline: route journald through loki.process, drop entries older than 168h, and align WAL max_segment_age. Aligned WAL retention and log dropping policy (older_than = 168h). It is also aligned with the Grafana 7-day (168h) default policy.

Performance Evaluation

The ghaf-clock-jump-watcher.service was monitored in two 30-minute window situations:

  • (i) idle scenario, with no clock jumps to detect;
  • (ii) when there were two clock jumps + recovery.

The following Table summarizes the CPU and memory consumption results for both scenarios:

Scenario Avg CPU (%) Max CPU (%) Avg Memory (MiB) Max Memory (MiB)
(i) 0.027 0.2 3.37 5.44
(ii) 0.039 0.4 3.36 7.91

Graph for scenario (i):
ghaf-clock-jump-watcher_metrics_20260113_110810

Graph for scenario (ii):
ghaf-clock-jump-watcher_metrics_20260113_114403

Type of Change

  • New Feature
  • Bug Fix
  • Improvement / Refactor

Related Issues / Tickets

https://jira.tii.ae/browse/SSRCSP-7772

Checklist

  • Clear summary in PR description
  • Detailed and meaningful commit message(s)
  • Commits are logically organized and squashed if appropriate
  • Contribution guidelines followed
  • Ghaf documentation updated with the commit - https://tiiuae.github.io/ghaf/
  • Author has run make-checks and it passes
  • All automatic GitHub Action checks pass - see actions
  • Author has added reviewers and removed PR draft status

Testing Instructions

Applicable Targets

  • Orin AGX aarch64
  • Orin NX aarch64
  • Lenovo X1 x86_64
  • Dell Latitude x86_64
  • System 76 x86_64

Installation Method

  • Requires full re-installation
  • Can be updated with nixos-rebuild ... switch
  • Other:

Test Steps To Verify:

You can perform the exact same steps as described at https://jira.tii.ae/browse/SSRCSP-7772:

  1. Boot ghaf on the laptop and connect to internet
  2. Open terminal in gui-vm and check device-id: cat /persist/common/device-id
  3. Check that there are admin-vm logs from the last minutes in grafana:https://ghaflogs.vedenemo.dev/explore e.g. {machine="00-f0-a4-58-bc", host="admin-vm"}
  4. Disconnect from internet
  5. Open terminal and ssh to admin-vm: ssh ghaf@admin-vm
  6. Set wrong time to admin-vm: sudo date -s '01/11/23 11:00:00 UTC'
  7. Verify time change: timedatectl -a
  8. Connect laptop again to internet
  9. In admin-vm: systemctl restart systemd-timesyncd.service
    9.1. This step is not mandatory, as the system does it by default when back online
  10. Verify time has been updated: timedatectl -a
  11. Check if admin-vm logs are still forwarded to grafana

- Add ghaf.logging.recovery options and shared clock-jump watcher + recover oneshot.
- Ensure alloy.service is ordered after/requires systemd-journald on client and server.
- Server pipeline: route journald through loki.process, drop entries older than 168h, and align WAL max_segment_age.

Signed-off-by: Everton de Matos <[email protected]>
@brianmcgillion brianmcgillion merged commit 93eb002 into tiiuae:main Jan 14, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants