Skip to content

Investigating repeated LMDB errors: could a MITM / malicious peers be pushing a covert fork? #10140

@danindiana

Description

@danindiana

Title: Investigating repeated LMDB errors: could a MITM / malicious peers be pushing a covert fork?

Short summary

  • Symptom: repeated LMDB errors during long syncs (examples: MDB_PAGE_NOTFOUND / MDB_BAD_TXN / MDB_CORRUPTED) that leave the DB unusable. db-salvage often fails.
  • Hypothesis: some peers (or an on-path attacker) are providing crafted/invalid block data intended to fork/confuse the node; when the daemon attempts to apply that data it causes LMDB transaction failures and results in DB corruption.
  • Goal: collect minimal, actionable evidence so we can rule in/out a network-level attack vector and recommend mitigations.

Why I suspect a network-level attack (evidence summary)

  • Errors occur while applying specific blocks: logs repeatedly show "Error adding block with hash <...> … Error adding spent key image to db transaction: MDB_PAGE_NOTFOUND / MDB_BAD_TXN".
  • Around the same time the daemon logs "Sync data returned a new top block candidate: 1591748 -> 3513147 [Your node is 1921399 blocks behind]", indicating peer(s) advertised a different top than the local one.
  • The daemon repeatedly blocks peers after the error — suggests the node considers them misbehaving.
  • Hardware checks (SMART, dmesg, fsck) show no current device errors; SMART shows many unsafe shutdowns which can cause corruption but does not fully explain block-application errors.
  • db-salvage is unreliable and sometimes segfaults; recovery is difficult, amplifying the impact.

Artifacts attached (sanitized)

  • artifacts/bitmonero.log.snippet.txt — key log lines showing MDB errors and sync messages (PII removed / IPs sanitized)
  • artifacts/monerod-status.txt — monerod status output around the failure
  • artifacts/monerod-print_pl.txt — peer list summary (IP addresses removed; blocked peers replaced with placeholders)
  • artifacts/smartctl.txt — SMART output for /dev/nvme0n1 (health summary; SAFE/FAILED fields)
  • artifacts/fsck.txt — fsck.ext4 output run on the ext4 partition used for the blockchain

What I’ve already done

  • Captured the above artifacts at the time of failure.
  • Ran sudo smartctl -a /dev/nvme0n1 — device reported SMART overall-health: PASSED; Unsafe Shutdowns: 151; Media errors: 0.
  • Ran sudo fsck.ext4 -f after unmount — no repairable errors found.
  • Deleted LMDB and started a fresh sync; fresh start proceeds fine until the next incident (i.e., corruption can reappear).

Concrete requests for maintainers / devs

  1. Could malformed or malicious P2P payloads cause the daemon to attempt DB writes that result in MDB_PAGE_NOTFOUND? If so, what checks can detect this earlier (before DB damage)?
  2. Recommended forensics to prove a malicious peer pushed invalid data:
    • Which exact fields to compare between peers (header hashes at the same height, block header metadata, tx indexes)?
    • Which monerod RPCs or commands will fetch the authoritative block header/hash for a given height for quick comparison?
  3. Any suggestions for deterministic checks to detect a forked/modified block before it is committed to LMDB (e.g., multi-peer verification, header majority checks)?
  4. Which p2p/debug flags or trace-level formats should I capture (pcap filters, ports, timestamps) to make a minimal, useful forensic capture for devs to analyze? (I can capture traffic to/from port 18080 for the peer IPs active at the time.)

Suggested experiments / mitigations I can run and report back

  • Controlled single-peer test: run monerod with a single trusted peer (e.g., --add-exclusive-node) and see whether corruption recurs.
  • Multi-peer comparison: when the failure happens, immediately fetch block header/hash for the failing height from multiple public nodes/explorers and compare.
  • Run with --db-sync-mode=safe to reduce the window where partial writes can corrupt DB; report whether corruption stops.
  • Run monerod behind a firewall that restricts peers and watch for the issue.

Copy of sanitized artifacts are attached to this issue under artifacts/.

If you need additional logs or a specific trace format, tell me the exact commands and time window to capture and I will run them on the next failure and attach the outputs.

Thank you — I'm happy to gather whatever extra data is most useful to reproduce or rule out a network attack vector.

-- End of issue body --

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions