docs: add architecture notes on inter-VM channels, memory wipe, and secret handling

vadika · brianmcgillion · commit 493d06e38610 · 2025-12-31T14:05:46.000+04:00
Signed-off-by: vadik likholetov &lt;vadikas@gmail.com&gt;
diff --git a/docs/astro.config.mjs b/docs/astro.config.mjs
@@ -44,8 +44,11 @@ export default defineConfig({
                       items: [
                         "ghaf/overview/arch",
                         "ghaf/overview/arch/system-architecture",
+                        "ghaf/overview/arch/inter-vm-communication-control",
                         "ghaf/overview/arch/variants",
                         "ghaf/overview/arch/hardening",
+                        "ghaf/overview/arch/vm-memory-wipe",
+                        "ghaf/overview/arch/prohibited-hardcoded-secrets",
                         "ghaf/overview/arch/critical-services-privilege-escalation",
                         "ghaf/overview/arch/system-logs-encryption",
                         "ghaf/overview/arch/vm-network-separation",
diff --git a/docs/src/content/docs/ghaf/overview/arch/inter-vm-communication-control.mdx b/docs/src/content/docs/ghaf/overview/arch/inter-vm-communication-control.mdx
@@ -0,0 +1,55 @@
+---
+title: Controlled, Auditable Inter-VM Communication
+description: Why Ghaf restricts inter-VM communication to tightly controlled, auditable channels
+---
+
+# Controlled, Auditable Inter-VM Communication
+
+Ghaf isolates workloads into MicroVMs and then deliberately constrains how those VMs can talk to each other. Inter-VM communication is allowed only through explicitly designed, controlled channels because the VM boundary is the primary security barrier. If that boundary is pierced by arbitrary or hidden links, the isolation model collapses. This document explains the security, operational, and auditability reasons for keeping inter-VM communication narrow and observable.
+
+## Why unrestricted inter-VM links are unacceptable
+
+### Preserve isolation and limit lateral movement
+Ghaf treats each VM as a distinct trust domain. High-risk or network-facing components run in isolated VMs, and the architecture relies on minimal, audited inter-VM interfaces to prevent lateral movement if one VM is compromised. Allowing ad-hoc or implicit communication paths would create invisible trust bridges that bypass the isolation model and increase the blast radius of a compromise.
+
+### Enforce least privilege across domains
+Ghaf applies least privilege at the VM boundary, not just within a single OS instance. A VM should only have the minimal rights and interfaces required to do its job. A controlled inter-VM channel allows Ghaf to restrict what a VM can request, who it can talk to, and what data can traverse the boundary. This fits the platform-wide principle of least privilege and maintains clear trust levels between untrusted, trusted, and system roles.
+
+### Reduce attack surface and ambiguity
+Every additional communication path is a new attack surface and a new source of ambiguity during incident response. A small, well-defined set of inter-VM channels reduces the number of protocols to defend and makes it possible to reason about cross-VM dependencies. This is especially important in a zero-trust architecture where all cross-domain flows must be explicit and justified.
+
+## How Ghaf enforces controlled communication
+
+### GIVC as the primary control channel
+Ghaf uses GIVC (Guest Inter-VM Communication) as the secure, structured control plane for VM-to-VM and host-to-VM interactions. GIVC provides authenticated message passing, VM identity verification, and resource access control, which keeps cross-VM interactions explicit and policy-driven. It also uses a registry-based model where services and applications must be registered, and only whitelisted operations are exposed.
+
+### Default-deny networking and interface separation
+Inter-VM traffic is constrained by network segmentation and strict firewalling. Production services bind on the production interface and use TLS for inter-VM or external communications, while debug tooling is confined to a separate interface and admin plane. This separation avoids mixing trust levels and prevents debug or administrative pathways from leaking into production networks.
+
+### Minimal, auditable interfaces by design
+Ghaf avoids broad cross-VM APIs in favor of narrow, auditable interfaces. For example, GIVC provides a controlled control plane, while other specialized components (such as memory-based sockets for inter-VM messaging) are scoped to their specific use cases. This limits the amount of cross-domain functionality and keeps each interface reviewable.
+
+### TLS and identity-bound channels
+GIVC supports TLS for gRPC communication and defaults to secure transport. Encrypting and authenticating inter-VM traffic prevents unauthenticated or spoofed connections and ensures the integrity of commands and data that cross VM boundaries.
+
+## Why auditability matters
+
+### Verifiable security posture
+Controlled channels enable meaningful auditing because all cross-VM actions pass through known components with explicit policies. This supports the broader Ghaf practice of auditable configuration and declarative security, enabling reviewers to validate what communication paths exist and why.
+
+### Monitoring and incident response
+When communication is centralized and constrained, monitoring tools and audit rules can focus on specific channels and events. This makes it feasible to detect anomalous behavior, reconstruct cross-VM activity during investigations, and apply targeted mitigations without blanket restrictions.
+
+### Alignment with platform hardening
+Auditable inter-VM communication complements Ghaf hardening measures such as systemd sandboxing, AppArmor confinement, and strict firewalling. These layers reinforce each other by ensuring that even if a VM or service is compromised, its ability to affect other domains is limited and visible.
+
+## Practical implications for system design
+
+- Inter-VM communication should go through GIVC or other explicitly approved channels, not ad-hoc sockets or raw networking paths.
+- Services and applications must be registered and whitelisted before they can be invoked across VM boundaries.
+- Debug or maintenance traffic must remain isolated from production traffic through dedicated interfaces and policies.
+- Security reviews should treat any new inter-VM interface as a high-risk change that requires justification, authentication, and auditability.
+
+## Summary
+
+Ghaf’s isolation model is only as strong as the boundaries between VMs. By restricting inter-VM communication to controlled, auditable channels, Ghaf preserves isolation, enforces least privilege, reduces attack surface, and enables reliable monitoring and response. GIVC and network plane separation provide the foundation for this approach, while secure transport and strict registration ensure that cross-VM interactions are explicit, minimal, and verifiable.
diff --git a/docs/src/content/docs/ghaf/overview/arch/prohibited-hardcoded-secrets.mdx b/docs/src/content/docs/ghaf/overview/arch/prohibited-hardcoded-secrets.mdx
@@ -0,0 +1,62 @@
+---
+title: Prohibited Hardcoded Credentials and Cryptographic Secrets
+description: Why secrets must never be hardcoded in Ghaf code and how the policy is implemented in infrastructure
+---
+
+# Prohibited Hardcoded Credentials and Cryptographic Secrets
+
+Ghaf forbids hardcoded credentials and cryptographic secrets in code, configuration defaults, or build artifacts. Secrets must be supplied through controlled secret-management mechanisms rather than embedded in source files. This protects the integrity of the platform, the confidentiality of infrastructure credentials, and the ability to rotate keys safely.
+
+## Why hardcoded secrets are prohibited
+
+### Git history is forever
+Hardcoded credentials leak easily and persist in version control history, forks, caches, and mirrors. Even if removed later, prior commits can still expose sensitive data. This makes revocation and incident response far more difficult and risky.
+
+### Secrets must be scoped and rotated
+Hardcoded values tend to become shared and long-lived. They cannot be scoped to a host, service, or environment, and rotation becomes disruptive. Centralized secret management enables per-host scoping and controlled rotation without editing source code.
+
+### Reproducible builds should not embed private data
+Ghaf relies on reproducible, declarative configuration. Embedding secrets in Nix expressions or source code breaks that model by baking private data into build outputs and undermines auditability.
+
+### Infrastructure is a high‑value target
+CI/CD and deployment systems hold signing keys, host SSH keys, and admin credentials. Hardcoded secrets in these paths materially increase the risk of supply-chain compromise.
+
+## How the policy is implemented (ghaf‑infra)
+
+Ghaf’s infrastructure repository uses an explicit secret‑management workflow designed to keep credentials out of code:
+
+### Encrypted secrets in version control
+- Secrets are stored in `secrets.yaml` files per host and encrypted using `sops`.
+- The repository documents that all configuration, including secrets, is version controlled and that secrets are encrypted rather than stored in plaintext.
+
+### SOPS + age key management
+- `.sops.yaml` defines which age keys can decrypt specific secrets files and establishes creation rules per host.
+- Admin users manage secrets with the `sops` CLI, ensuring edits are always encrypted before committing.
+
+### Secure deployment and activation
+- During deployment or system activation, `sops-nix` decrypts secrets and places them at the configured filesystem paths for services to consume.
+- Host SSH private keys and other credentials are stored as encrypted secrets and deployed during installation, avoiding hardcoded values in host configurations.
+
+### Controlled updates and rotation
+- The `update-sops-files` task re-encrypts secrets according to `.sops.yaml` rules when hosts or admins change, which enables rotation without exposing secrets in code.
+
+## Repository checks and hooks (ghaf-infra)
+
+The ghaf-infra repository includes security-oriented CI workflows that help reduce exposure risk, even though they are not dedicated secret scanners:
+
+- **GitHub Actions security analysis** (`.github/workflows/actions-security-analysis.yml`): runs `zizmor` to audit GitHub Actions workflows and uploads SARIF results.
+- **CodeQL** (`.github/workflows/codeql.yml`): static analysis of repository code (Python) with results published to code scanning.
+- **OpenSSF Scorecard** (`.github/workflows/scorecards.yml`): supply-chain security posture checks, reported to code scanning.
+- **Dependency Review** (`.github/workflows/dependency-review.yml`): blocks known-vulnerable dependency changes in PRs.
+- **Workflow change warning** (`.github/workflows/warn-on-workflow-changes.yml`): alerts on workflow modifications to reduce CI abuse risk.
+- **Build/test pipeline** (`.github/workflows/test-ghaf-infra.yml`): enforces authorization gates and hardened runner settings before executing builds.
+
+## Practical guidance for contributors
+
+- Never place passwords, private keys, tokens, or signing keys directly in source files or Nix configs.
+- Use per‑host `secrets.yaml` encrypted with `sops` and governed by `.sops.yaml` rules.
+- Treat any new secret as a security‑review item: it must be scoped, encrypted, and deployable without code changes.
+
+## Summary
+
+Hardcoded credentials and cryptographic secrets create permanent, high‑impact exposure in a Git‑based workflow. Ghaf avoids this by using encrypted secrets managed with `sops-nix` and age keys, with controlled decryption at deployment time. This keeps secrets out of code, preserves auditability, and enables safe rotation across the infrastructure.
diff --git a/docs/src/content/docs/ghaf/overview/arch/vm-memory-wipe.mdx b/docs/src/content/docs/ghaf/overview/arch/vm-memory-wipe.mdx
@@ -0,0 +1,48 @@
+---
+title: VM Memory Zeroing and Wipe on Shutdown
+description: Why Ghaf clears VM memory on shutdown and how it prevents cross-VM data leakage
+---
+
+# VM Memory Zeroing and Wipe on Shutdown
+
+Ghaf treats VM memory as sensitive because it often contains secrets (keys, tokens, decrypted content, session state). When a VM shuts down, the memory pages it used return to the host. If those pages are not cleared, a later VM or host process could observe residual data. To prevent this class of data remanence issues, Ghaf ensures that VM memory is wiped as it is freed and zeroed before it is reused.
+
+## Why memory must be cleared on shutdown
+
+### Prevent cross-VM data leakage
+MicroVMs are strict trust boundaries. Reusing physical pages without clearing can expose one VM’s data to another VM or to the host after shutdown. Wiping memory on free and zeroing on allocation removes residual data from prior tenants and preserves the confidentiality of VM state across lifecycle events.
+
+### Reduce the blast radius after compromise
+A compromised VM should not be able to recover data from previously terminated VMs by scavenging uninitialized memory. Zeroing on allocation is a direct defense against this class of information disclosure bugs and helps maintain isolation even under adverse conditions.
+
+### Protect secrets at rest in RAM
+Many high-value secrets live only in RAM: TLS keys, authentication tokens, and decrypted files. On shutdown, these secrets should not persist in physical memory. Clearing memory on free ensures they do not remain available to a later VM or process.
+
+### Support auditable, repeatable security posture
+Ghaf’s security model depends on explicit, verifiable controls. Memory wipe on free/alloc is a build-time kernel configuration that is easy to audit and deterministic across builds, making the behavior reliable and reviewable.
+
+## How Ghaf implements memory clearing
+
+### Kernel-level zero-on-free and zero-on-alloc
+Ghaf configures the host kernel with built-in protections:
+
+- `INIT_ON_FREE_DEFAULT_ON` wipes pages when they are released back to the allocator.
+- `INIT_ON_ALLOC_DEFAULT_ON` zeroes pages before they are handed to a new consumer.
+- `PAGE_POISONING` provides additional overwrite protection for freed pages.
+
+These options are compiled into the host kernel, so they are always active during runtime. When a VM shuts down and its memory is freed, the host kernel wipes those pages. When another VM (or the host) later allocates memory, the pages are zeroed again. This double-layered approach helps prevent memory remanence across VM lifecycles.
+
+### Default enablement on x86_64 hosts
+Memory wipe is enabled by default for the host kernel on x86_64 platforms via the `ghaf.host.kernel.memory-wipe.enable` option. This focuses on the system component that owns and recycles VM memory, which is where shutdown wiping is enforced.
+
+For implementation details and configuration knobs, see [Memory Wipe on Boot and Free](/ghaf/dev/ref/memory-wipe).
+
+## Operational considerations
+
+- **Performance trade-off**: Zeroing on free/alloc adds a modest overhead, but the security benefits outweigh the cost for a hardened platform.
+- **Defense in depth**: Memory wiping complements other controls (VM isolation, least privilege, audited inter-VM channels) by ensuring that even after shutdown, no residual data crosses trust boundaries.
+- **Scope**: The wipe happens in the host kernel where VM pages are managed, so it applies consistently to VM shutdown events regardless of the guest OS.
+
+## Summary
+
+VM shutdown is a sensitive moment because memory pages leave one trust domain and return to the host allocator. Ghaf eliminates the risk of residual data reuse by enabling kernel features that wipe memory on free and zero memory on allocation. This protects secrets, prevents cross-VM leakage, and keeps the VM boundary trustworthy throughout the VM lifecycle.