Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal test for RangeTombstoneSnapshotMigrateFromLast, refactor #13244

Closed

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Dec 23, 2024

Summary:

  • Expand RangeTombstoneSnapshotMigrateFromLast in tiered_compaction_test (originally from Steps toward preserve/preclude options mutable #13124) to reproduce a failure in universal compaciton (as well as leveled), when a specific part of the test is uncommented.
  • Small refactoring to eliminate unnecessary fields in SubcompactionState. Adding a bool parameter to SubcompactionState::AddToOutput here will make more sense in the next PR (which I'm trying to keep
    from getting too big).
  • Improve debuggability and performance of some other tests
  • Remove accidentally committed test "BlahPrecludeLastLevel" which was a temporary copy of CompactionServiceTest.PrecludeLastLevel

Test Plan: existing tests, updated/expanded tests

Summary:
* Expand RangeTombstoneSnapshotMigrateFromLast in tiered_compaction_test
  (originally from facebook#13124) to reproduce a failure in universal
  compaciton (as well as leveled), when a specific part of the test is
  uncommented.
* Small refactoring to eliminate unnecessary fields in
  SubcompactionState. Adding a bool parameter to
  SubcompactionState::AddToOutput will make more sense in the next PR.
* Improve debuggability and performance of some other tests

Test Plan: existing tests, updated/expanded tests
Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1651,7 +1651,8 @@ TEST_P(PrecludeLastLevelTest, CheckInternalKeyRange) {

ASSERT_EQ("0,0,0,0,0,2,2", FilesPerLevel());

auto VerifyLogicalState = [&]() {
auto VerifyLogicalState = [&](int line) {
SCOPED_TRACE("Called from line " + std::to_string(line));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (!universal) {
MoveFilesToLevel(5);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to wait for compaction here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, about to submit that.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 30d5162.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Dec 27, 2024
Summary: The primary goal of this change was to support full dynamic
mutability of options `preclude_last_level_data_seconds` and
`preserve_internal_time_seconds`, which was challenging because of
subtle design holes referenced from facebook#13124.

The fix is, in a sense, "doubling down" on the idea of write-time-based
tiering, by simplifying the output level decision with a single sequence
number threshold. This approach has some advantages:
* Allows option mutability in presence of long snapshots (or UDT)
* Simpler to believe correct because there's no special treatment for
  range tombstones, and output level assignment does not affect sequence
  number assignment to the entries (which takes some care to avoid
  circular dependency; see CompactionIterator stuff below).
* Avoids extra key comparisons, in `WithinPenultimateLevelOutputRange()`,
  in relevant compactions (more CPU efficient, though untested).

There are two big pieces/changes to enable this simplification to a
single `penultimate_after_seqno_` threshold:
* Allow range tombstones to be sent to either output level, based on
  sequence number.
* Use sequence numbers instead of range checks to avoid data in the last
  level from moving to penultimate level outside of the permissable
  range on that level (due to compaction selecting wider range in the
  later input level, which is the normal output level). With this
  change, data can only move "back up the LSM" when entire sorted runs
  are selected for comapction.

Possible disadvantages:
* Extra CPU to iterate over range tombstones in relevant compactions
  *twice* instead of once. However, work loads with lots of range tombstones
  relative to other entries should be rare.
* Data might not migrate back up the LSM tree on option changes as
  aggressively or consistently. This should a a rare concern, however,
  especially for universal compaction where selecting full sorted runs
  is normal compaction.
* This approach is arguably "further away from" a design that allows for
  other kinds of output level placement decisions, such as range-based
  input data hotness. However, properly handling range tombstones with
  such policies will likely require flexible placement into outputs,
  as this change introduces.

Additional details:
* For good code abstraction, separate CompactionIterator from the
  concern of where to place compaction outputs. CompactionIterator is
  supposed to provide a stream of entries, including the "best" sequence
  number we can assign to those entries. If it's safe and proper to zero
  out a sequence number, the placement of entries to outputs should deal
  with that safely rather than having complex inter-dependency between
  sequence number assignment and placement. To achieve this, we migrate
  all the compaction output placement logic that was in CompactionIterator
  to CompactionJob and similar. This unfortunately renders some unit
  tests (PerKeyPlacementCompIteratorTest) depending on the bad
  abstraction as obsolete, but tiered_compaction_test has pretty good
  coverage overall, catching many issues during this development.

Intended follow-up:
* See FIXME items in tiered_compaction_test
* More testing / validation / support for tiering + UDT
* Consider generalizing this work to split results at other levels as
  appropriate based on stats (auto-tuning essentially). Allowing only the
  last level to be cold is limiting.

Test Plan: tests were added in previous changes (facebook#13244 facebook#13124), and
updated here to reflect correct operation (with some known problems for
leveled compaction)
facebook-github-bot pushed a commit that referenced this pull request Jan 3, 2025
Summary:
The primary goal of this change was to support full dynamic mutability of options `preclude_last_level_data_seconds` and `preserve_internal_time_seconds`, which was challenging because of subtle design holes referenced from #13124.

The fix is, in a sense, "doubling down" on the idea of write-time-based tiering, by simplifying the output level decision with a single sequence number threshold. This approach has some advantages:
* Allows option mutability in presence of long snapshots (or UDT)
* Simpler to believe correct because there's no special treatment for range tombstones, and output level assignment does not affect sequence number assignment to the entries (which takes some care to avoid circular dependency; see CompactionIterator stuff below).
* Avoids extra key comparisons, in `WithinPenultimateLevelOutputRange()`, in relevant compactions (more CPU efficient, though untested).

There are two big pieces/changes to enable this simplification to a single `penultimate_after_seqno_` threshold:
* Allow range tombstones to be sent to either output level, based on sequence number.
* Use sequence numbers instead of range checks to avoid data in the last level from moving to penultimate level outside of the permissable range on that level (due to compaction selecting wider range in the later input level, which is the normal output level). With this change, data can only move "back up the LSM" when entire sorted runs are selected for comapction.

Possible disadvantages:
* Extra CPU to iterate over range tombstones in relevant compactions *twice* instead of once. However, work loads with lots of range tombstones relative to other entries should be rare.
* Data might not migrate back up the LSM tree on option changes as aggressively or consistently. This should a a rare concern, however, especially for universal compaction where selecting full sorted runs is normal compaction.
* This approach is arguably "further away from" a design that allows for other kinds of output level placement decisions, such as range-based input data hotness. However, properly handling range tombstones with such policies will likely require flexible placement into outputs, as this change introduces.

Additional details:
* For good code abstraction, separate CompactionIterator from the concern of where to place compaction outputs. CompactionIterator is supposed to provide a stream of entries, including the "best" sequence number we can assign to those entries. If it's safe and proper to zero out a sequence number, the placement of entries to outputs should deal with that safely rather than having complex inter-dependency between sequence number assignment and placement. To achieve this, we migrate all the compaction output placement logic that was in CompactionIterator to CompactionJob and similar. This unfortunately renders some unit tests (PerKeyPlacementCompIteratorTest) depending on the bad abstraction as obsolete, but tiered_compaction_test has pretty good coverage overall, catching many issues during this development.

Intended follow-up:
* See FIXME items in tiered_compaction_test
* More testing / validation / support for tiering + UDT
* Consider generalizing this work to split results at other levels as appropriate based on stats (auto-tuning essentially). Allowing only the last level to be cold is limiting.

Pull Request resolved: #13256

Test Plan: tests were added in previous changes (#13244 #13124), and updated here to reflect correct operation (with some known problems for leveled compaction)

Reviewed By: cbi42

Differential Revision: D67683210

Pulled By: pdillinger

fbshipit-source-id: ca3f2bbc2fcc6891516a2a4220f1b0da09af5ade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants