Skip to content

Conversation

quantumish
Copy link
Contributor

No description provided.

// we *shouldn't* add them to the free list (since that's used to write new entries)
// but we still should remember them so that we can add them to the free list when
// a growth later occurs. this still has the same scalability flaws as the freelist
hole_list: Mutex<Vec<CacheBlock>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: the filesystem already tracks the holes (SEEK_HOLE), can we get away with relying on that? Okay to leave that as a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch! Yeah, I can try and modify it to rely on that. I was mostly going off the original strategy in file_cache.c that did something similar by tracking holes in the hashmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor caveat is that SEEK_HOLE isn't guaranteed to actually find the next actual hole in the filesystem:

However, a filesystem is not obliged to report holes [...] In the simplest implementation, a filesystem can support the operations by making SEEK_HOLE always return the offset of the end of the file

It is supported by most popular filesystems though so it's probably safe to use regardless? I'll document it in the code but there may need to be some conditional compilation at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently only care about ext4, but it's worth a note and test case.

It's also possible that this is significantly slower, idk. It's not a big deal, just figured it might be nice to avoid the manual tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, was just gonna comment about performance. Not entirely sure if it's worse though, will experiment!

Copy link
Contributor Author

@quantumish quantumish Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems like this could be accomplished in only one syscall less syscalls with FIEMAP? One issue with the lseek approach is if there's many holes it could lead to some nontrivial overhead - and we almost certainly will have far too many holes as our current way of shrinking is to punch single-block holes randomly spread across the LFC file.

Comment on lines +145 to +147
// TODO(quantumish): possibly implement some batching? lots of syscalls...
// unfortunately should be at odds with our access pattern as entries in the hashmap
// should have no correlation with the location of blocks in the actual LFC file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing LFC does the same, but it may be getting away with it because of the 128-page chunking which amortizes the cost. It's probably okay as long as we do this async after the actual hashmap has shrunk and allow concurrent access.

Comment on lines +714 to +719
// Try again at evicting entries in to-be-shrunk region, except don't give up this time.
// Not a great solution all around: unnecessary scanning, spinning, and code duplication.
// Not sure what a good alternative is though, as there may be enough of these entries that
// we can't store a Vec of them and we ultimately can't proceed until they're evicted.
// Maybe a notification system for unpinning somehow? This also makes me think that pinning
// of entries should be a first class concept within the hashmap implementation...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do we expect entries to remain pinned for? To avoid starvation under heavy reads, can we mark the entry as soft-removed such that new accesses incur a miss and will allocate a new bucket? Can we do a best-effort 1ms sleep for each pinned entry as we find them under the assumption that the reader will be done soon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to experiment a bit more to get a good grasp on how long entries remain pinned for (and whether re-pinning is the main issue or just long pin times), but in theory it depends on the workload. I wrote this code with pessimistic assumptions.

Soft-removal sounds like a good strategy, though / this sleep can definitely be optimized. Will update as I get more information!

{
let mut clock_hand = self.clock_hand.lock().unwrap();

block_map.begin_shrink(num_blocks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand the overall behavior here:

  1. Update alloc_limit. No new bucket allocations will be made beyond the new size.

    • Cheap O(1) under write lock.
    • Dictionary remains unchanged, as does the hashing.
    • Existing dictionary entries may still point beyond alloc_limit.
    • New dictionary entries will allocate buckets below alloc_limit.
  2. Evict buckets beyond alloc_limit.

    • O(removed_blocks) operations, each O(1) under write lock.
    • Existing buckets beyond alloc_limit can still see both reads and writes.
    • After a bucket has been evicted, new writes will allocate a bucket below alloc_limit.
  3. Resize dictionary.

    • O(remaining_blocks) operations under write lock.
    • Rehashes dictionary, keeps existing buckets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is mostly accurate, although this is excluding the retry part / LFC file behavior. Step 1 also has the additional property of changing the clock hand and the "logical" number of buckets (so that the eviction algorithm will treat the hashmap as if it was already shrunk, since we're gonna manually evict things in the to-be-shrunk region anyway). Step 3 has technically O(capacity) operations if you include the rehash (once incremental rehashing is merged it'll just be O(capacity) though and technically that's just for checking if invariants are maintained so it could be removed from release builds).

Copy link

github-actions bot commented Aug 3, 2025

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

Copy link

github-actions bot commented Aug 4, 2025

No tests were run or test report is not available

Test coverage report is not available

The comment gets automatically updated with the latest test results
5f5d512 at 2025-08-04T01:24:10.697Z :recycle:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants