-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
Description
Title: Investigating repeated LMDB errors: could a MITM / malicious peers be pushing a covert fork?
Short summary
- Symptom: repeated LMDB errors during long syncs (examples: MDB_PAGE_NOTFOUND / MDB_BAD_TXN / MDB_CORRUPTED) that leave the DB unusable. db-salvage often fails.
- Hypothesis: some peers (or an on-path attacker) are providing crafted/invalid block data intended to fork/confuse the node; when the daemon attempts to apply that data it causes LMDB transaction failures and results in DB corruption.
- Goal: collect minimal, actionable evidence so we can rule in/out a network-level attack vector and recommend mitigations.
Why I suspect a network-level attack (evidence summary)
- Errors occur while applying specific blocks: logs repeatedly show "Error adding block with hash <...> … Error adding spent key image to db transaction: MDB_PAGE_NOTFOUND / MDB_BAD_TXN".
- Around the same time the daemon logs "Sync data returned a new top block candidate: 1591748 -> 3513147 [Your node is 1921399 blocks behind]", indicating peer(s) advertised a different top than the local one.
- The daemon repeatedly blocks peers after the error — suggests the node considers them misbehaving.
- Hardware checks (SMART, dmesg, fsck) show no current device errors; SMART shows many unsafe shutdowns which can cause corruption but does not fully explain block-application errors.
- db-salvage is unreliable and sometimes segfaults; recovery is difficult, amplifying the impact.
Artifacts attached (sanitized)
- artifacts/bitmonero.log.snippet.txt — key log lines showing MDB errors and sync messages (PII removed / IPs sanitized)
- artifacts/monerod-status.txt —
monerod status
output around the failure - artifacts/monerod-print_pl.txt — peer list summary (IP addresses removed; blocked peers replaced with placeholders)
- artifacts/smartctl.txt — SMART output for /dev/nvme0n1 (health summary; SAFE/FAILED fields)
- artifacts/fsck.txt —
fsck.ext4
output run on the ext4 partition used for the blockchain
What I’ve already done
- Captured the above artifacts at the time of failure.
- Ran
sudo smartctl -a /dev/nvme0n1
— device reported SMART overall-health: PASSED; Unsafe Shutdowns: 151; Media errors: 0. - Ran
sudo fsck.ext4 -f
after unmount — no repairable errors found. - Deleted LMDB and started a fresh sync; fresh start proceeds fine until the next incident (i.e., corruption can reappear).
Concrete requests for maintainers / devs
- Could malformed or malicious P2P payloads cause the daemon to attempt DB writes that result in MDB_PAGE_NOTFOUND? If so, what checks can detect this earlier (before DB damage)?
- Recommended forensics to prove a malicious peer pushed invalid data:
- Which exact fields to compare between peers (header hashes at the same height, block header metadata, tx indexes)?
- Which monerod RPCs or commands will fetch the authoritative block header/hash for a given height for quick comparison?
- Any suggestions for deterministic checks to detect a forked/modified block before it is committed to LMDB (e.g., multi-peer verification, header majority checks)?
- Which p2p/debug flags or trace-level formats should I capture (pcap filters, ports, timestamps) to make a minimal, useful forensic capture for devs to analyze? (I can capture traffic to/from port 18080 for the peer IPs active at the time.)
Suggested experiments / mitigations I can run and report back
- Controlled single-peer test: run monerod with a single trusted peer (e.g.,
--add-exclusive-node
) and see whether corruption recurs. - Multi-peer comparison: when the failure happens, immediately fetch block header/hash for the failing height from multiple public nodes/explorers and compare.
- Run with
--db-sync-mode=safe
to reduce the window where partial writes can corrupt DB; report whether corruption stops. - Run monerod behind a firewall that restricts peers and watch for the issue.
Copy of sanitized artifacts are attached to this issue under artifacts/
.
If you need additional logs or a specific trace format, tell me the exact commands and time window to capture and I will run them on the next failure and attach the outputs.
Thank you — I'm happy to gather whatever extra data is most useful to reproduce or rule out a network attack vector.
-- End of issue body --