Skip to content

Conversation

@zilm13
Copy link
Contributor

@zilm13 zilm13 commented Dec 17, 2025

PR Description

As user could quit at any moment during forward sync, this PR requires following prerequisites:

  • enabling reworked DasCustodyBackfiller by default
  • revisiting DasCustodyBackfiller's first run checks as not fully completed slots could be in a big range

Looks working:

2025-12-17 23:29:30.882 INFO  - Check is finished for 13176947(0x062c7b2808cd05112d6b507efd95e5bc84ee4464612e74408530032706868b9e), missing 64
2025-12-17 23:29:41.104 INFO  - Block 13176947(0x062c7b2808cd05112d6b507efd95e5bc84ee4464612e74408530032706868b9e) has been imported
2025-12-17 23:29:43.708 INFO  - Full check is finished for 13176947(0x062c7b2808cd05112d6b507efd95e5bc84ee4464612e74408530032706868b9e), missing 0
2025-12-17 23:29:47.049 INFO  - Removed fetch completed 13176947(0x062c7b2808cd05112d6b507efd95e5bc84ee4464612e74408530032706868b9e)
2025-12-17 23:29:41.407 INFO  - Check is finished for 13176949(0x01eeaa1de04a3bdc2c6a0ca080db0479aefff7b409f2a3e1c759b4a49ee9839f), missing 63
2025-12-17 23:29:43.495 INFO  - Block 13176949(0x01eeaa1de04a3bdc2c6a0ca080db0479aefff7b409f2a3e1c759b4a49ee9839f) has been imported
2025-12-17 23:29:43.749 INFO  - Full check is finished for 13176949(0x01eeaa1de04a3bdc2c6a0ca080db0479aefff7b409f2a3e1c759b4a49ee9839f), missing 0
2025-12-17 23:29:47.049 INFO  - Removed fetch completed 13176949(0x01eeaa1de04a3bdc2c6a0ca080db0479aefff7b409f2a3e1c759b4a49ee9839f)

Fixed Issue(s)

Main part of #9938

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

  • I thought about adding a changelog entry, and added one if I deemed necessary.

Note

Introduces optional early completion of data availability after 50% of columns, controlled by a new P2P config and CLI flag, with sampler/tracker logic and tests updated.

  • Data Availability Sampling (DAS):
    • Add halfColumnsSamplingCompletionEnabled to DasSamplerBasic and pass to DataColumnSamplingTracker to allow early completion when half of columns are sampled (Fulu-derived column count/2).
    • Replace rpcFetchScheduled with rpcFetchInProgress; adjust scheduling/reset logic and first-seen handling.
    • Update slot-pruning to only remove fully sampled trackers; incomplete ones complete exceptionally when outdated/imported.
    • Reference tests construct DasSamplerBasic with early-completion disabled.
  • Tracker Enhancements (DataColumnSamplingTracker):
    • Track fullySampled and optional earlyCompletionRequirementCount; implement partial completion on threshold; expose getMissingColumnIdentifiers().
  • Config/CLI Wiring:
    • Add columnsDataAvailabilityHalfCheckEnabled to P2PConfig (default false) and builder method; expose getter.
    • Add hidden CLI flag --Xcolumns-data-availability-half-check-enabled; plumb into P2PConfig.
    • BeaconChainController passes the flag to DasSamplerBasic.
  • Tests:
    • Extend/adjust DAS sampler and tracker unit tests for early completion, in-progress flag, and pruning behavior.
    • Add CLI option tests for default and toggle behavior.

Written by Cursor Bugbot for commit 0aa89ba. This will update automatically on new commits. Configure here.

@zilm13 zilm13 added the blocked by another PR/issue 💔 This issue or pull request is blocked by another label Dec 17, 2025
@zilm13 zilm13 marked this pull request as ready for review December 18, 2025 14:50
@zilm13 zilm13 removed the blocked by another PR/issue 💔 This issue or pull request is blocked by another label Dec 18, 2025
@zilm13
Copy link
Contributor Author

zilm13 commented Dec 18, 2025

the decision is instead of blocking:

  • add hidden command line flag to turn feature on
  • disable by default
  • merge today, add issue for blockers, fix it later

Copy link
Contributor

@tbenr tbenr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions first

if (!removed) {
LOG.debug("Column {} was already marked as received, origin: {}", columnIdentifier, origin);
return false;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove else if we return in the main branch

Comment on lines 79 to 82
if (completionColumnCount().isPresent()
&& !completionFuture.isDone()
&& (samplingRequirement.size() - missingColumns().size())
>= completionColumnCount.get()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much noise here, can we have a dedicated method to check this condition?

if (missingColumns.isEmpty()) {
LOG.debug(
"Sampling complete for slot {} root {} via column {} received via {}",
"Fetching complete for slot {} root {} via column {} received via {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the renaming here.
We are still in the context of Sampling. We maybe in a "partial sampling" case or "full sampling case".

final Set<UInt64> missingColumns = ConcurrentHashMap.newKeySet(samplingRequirement.size());
missingColumns.addAll(samplingRequirement);
final SafeFuture<List<UInt64>> completionFuture = new SafeFuture<>();
final SafeFuture<List<UInt64>> fetchCompletionFuture = new SafeFuture<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this new future? which flow will be attached to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's talk about it, I don't understand you

tracker
.completionFuture()
.completeExceptionally(new RuntimeException("DAS sampling expired"));
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm worried about the possibility we discard the tracker but there is still fetchCompletionFuture not completed, with even some pending requests still pending to complete the full sampling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added completion to fetchCompletionFuture(), I'm not sure it's needed because there are no consumers for it which is different to completionFuture(). Anyway, let's have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well if we want to represent it with a SafeFuture we must maintain it correctly, otherwise we can make it as an AtomicBoolean and problem solved.


return false;
// cleanup only if fully sampled
return tracker.fullySampled().get();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak for orphaned blocks with early completion

The new cleanup logic in onSlot doesn't remove trackers for orphaned blocks that received early completion. When a block's slot becomes finalized but the block itself is not in the chain (orphaned due to reorg), and the tracker has completionFuture done from early completion but fullySampled is false, the tracker will never be cleaned up. The old logic removed any tracker with a done completionFuture when the slot was finalized or block imported. The new logic falls through to tracker.fullySampled().get() which returns false, keeping the tracker forever. These orphaned trackers accumulate because the DasCustodyBackfiller won't complete sampling for blocks not in the canonical chain.

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until we complete all flows around this feature, this is actually true. The feature is turned off by default due to this (and the fact that we can't currently guarantee we complete the sampling and thus set fullySampled to true)

tracker
.completionFuture()
.completeExceptionally(
new RuntimeException("DAS sampling expired while slot finalized"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition can prematurely remove partially completed trackers

There's a race condition between checking completionFuture().isDone() and calling completeExceptionally(). If another thread completes the future with partial completion (via isCompletedEarly()) between these two operations, completeExceptionally() becomes a no-op but the code still returns true, removing the tracker. For partial completions, fullySampled is not set to true, so the tracker should remain in the map to allow background downloading of remaining columns. The return value of completeExceptionally() could be used to detect this race.

Fix in Cursor Fix in Web

Copy link
Contributor

@tbenr tbenr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but one cursor's comment is important

.completionFuture()
.completeExceptionally(
new RuntimeException("DAS sampling expired while slot finalized"));
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can even remove this return true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants