Skip to content

LedgerDB V2: handles aren't closed in the common case #1515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amesgen opened this issue May 19, 2025 · 0 comments · Fixed by #1516
Closed

LedgerDB V2: handles aren't closed in the common case #1515

amesgen opened this issue May 19, 2025 · 0 comments · Fixed by #1516
Assignees

Comments

@amesgen
Copy link
Member

amesgen commented May 19, 2025

The V2 LedgerDB (introduced as part of UTxO HD in #1412) maintains a LedgerSeq at its core:

newtype LedgerSeq m l = LedgerSeq {
getLedgerSeq :: AnchoredSeq (WithOrigin SlotNo) (StateRef m l) (StateRef m l)
} deriving (Generic)

data StateRef m l = StateRef {
state :: !(l EmptyMK)
, tables :: !(LedgerTablesHandle m l)
} deriving (Generic)

Now, when we prune a LedgerSeq, we want to close the LedgerTablesHandles from the prefix that we are removing. The current code tries to do this, it it is not effective. Namely, prune returns the LedgerSeq prefix containing states to prune:

prune (LedgerDbPruneKeeping (SecurityParam k)) (LedgerSeq ldb) =
if toEnum nvol <= unNonZero k
then (LedgerSeq $ Empty (AS.anchor ldb), LedgerSeq ldb)
else
-- We remove the new anchor from the @fst@ component so that its handle is
-- not closed.
B.bimap (LedgerSeq . dropNewest 1) LedgerSeq $ AS.splitAt (nvol - fromEnum (unNonZero k)) ldb
where
nvol = AS.length ldb

And then in other places, we try to close the handles:

let (l, s) = prune (LedgerDbPruneKeeping (foeSecurityParam env)) (LedgerSeq newdb)
pure ((toClose, l), s)

mapM_ (close . tables) $ AS.toOldestFirst discardedBySelection ++ AS.toOldestFirst discardedByPruning

closeLedgerSeq :: Monad m => LedgerSeq m l -> m ()
closeLedgerSeq = mapM_ (close . tables) . toOldestFirst . getLedgerSeq

The crucial observation is that toOldestFirst does not return the anchor of the LedgerSeq/AnchoredSeq, so we do not close this. In particular, in the very common case of adopting one block (such that the immutable tip also advances by one block), which is the common case both while syncing and caught-up, we do not close anything.

The fix is simply to also consider the anchor of the sequence of handles to close.


Currently, this isn't problematic as the only V2 backend is the in-memory backend at the moment, and for it, closing handles doesn't matter (as the GC will still claim the ledger state once it is no longer referenced). However, in the upcoming LSM tree backend, this will be crucial.


Concrete "demonstration" of the bug:

Apply this diff on d582af5 (main at time of writing):

Show diff
diff --git a/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/LedgerDB/V2/InMemory.hs b/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/LedgerDB/V2/InMemory.hs
index 770a85862d..c4e2ac3115 100644
--- a/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/LedgerDB/V2/InMemory.hs
+++ b/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/LedgerDB/V2/InMemory.hs
@@ -90,7 +90,8 @@ newInMemoryLedgerTablesHandle ::
 newInMemoryLedgerTablesHandle someFS@(SomeHasFS hasFS) l = do
   !tv <- newTVarIO (LedgerTablesHandleOpen l)
   pure LedgerTablesHandle {
-      close =
+      close = do
+        !_ <- error "foo"
         atomically $ writeTVar tv LedgerTablesHandleClosed
     , duplicate = do
         hs <- readTVarIO tv
diff --git a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/LedgerDB/StateMachine.hs b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/LedgerDB/StateMachine.hs
index 5015d44106..33f842b615 100644
--- a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/LedgerDB/StateMachine.hs
+++ b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/LedgerDB/StateMachine.hs
@@ -270,14 +270,14 @@ instance StateModel Model where
   arbitraryAction _ model@(Model chain secParam) =
     frequency $ [ (2, pure $ Some GetState)
                 , (2, pure $ Some ForceTakeSnapshot)
-                , (1, Some . DropAndRestore <$> QC.choose (0, fromIntegral $ AS.length chain))
+                , (0, Some . DropAndRestore <$> QC.choose (0, fromIntegral $ AS.length chain))
                 , (4, Some <$> do
                       let maxRollback =
                             min
                               (fromIntegral . AS.length $ chain)
                               (BT.unNonZero $ maxRollbacks secParam)
-                      numRollback  <- QC.choose (0, maxRollback)
-                      numNewBlocks <- QC.choose (numRollback, numRollback + 2)
+                      numRollback  <- pure 0
+                      numNewBlocks <- pure 1
                       let
                         chain' = case modelRollback numRollback model of
                           UnInit     -> error "Impossible"

Then, running cabal run storage-test -- -p InMemV2 succeeds, even though we would expect it to fail if we would actually close the handles.

Concretely, this patch

  • modifies the close logic of the LedgerTablesHandle to always fail with an imprecise exception, and
  • modifies the LedgerDB to only generate ValidateAndCommit commands that add exactly one block, and disables DropAndRestore.

Then, if V2 were correctly closing pruned LedgerTablesHandles, then we would expect the test to fail; however, it passes just fine. Note that it does fail when keeping the test unmodified, showing that we at least sometimes close LedgerTablesHandles.

@amesgen amesgen self-assigned this May 19, 2025
@amesgen amesgen moved this to 👀 In review in Consensus Team Backlog May 19, 2025
@amesgen amesgen moved this to 👀 In review in Consensus Team Backlog May 19, 2025
github-merge-queue bot pushed a commit that referenced this issue Jun 5, 2025
Closes #1515

The first three commits add a regression test by enrichting the existing
LedgerDB state machine test, making sure that
```
  total number of all opened handles
- total number of closed handles
= states on the LedgerSeq in the end
```
The fourth commit then fixes the bug, causing the regression test to
succeed.

Note that this bug fix reveals another bug #1551 (causing the
`mock-test` failure for `LeaderSchedule.simple convergence`), which we
silence by temporarily making the closing of an in-memory handle a no-op
(as it is not *necessary* anyways, the GC will still collect the
in-memory ledger state).
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Consensus Team Backlog Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

1 participant