-
-
Notifications
You must be signed in to change notification settings - Fork 410
rough-out new BackfillSync class structure #8353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
vedant-asati
wants to merge
364
commits into
ChainSafe:backfill
Choose a base branch
from
vedant-asati:va/roughout-new-backfill-class
base: backfill
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
rough-out new BackfillSync class structure #8353
vedant-asati
wants to merge
364
commits into
ChainSafe:backfill
from
vedant-asati:va/roughout-new-backfill-class
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
**Motivation** - `onlyConnect*` flags are not (and should not) used anymore so we should remove it to reduce maintenance cost cc @matthewkeil Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** - this is a precondition for ChainSafe#7860 - when publishing a block, we need to ensure we have enough peers on all column subnets so that we can publish all blobs data to the network - this is not a concern for a node without validator but since this is very easy to achieve, we don't need to care this case **Description** - track `samplingGroups` on PeerData - for sampling groups, ensure at least 6 per subnets - for non-sampling groups, ensure at least 4 per subnets **Metrics on fusaka-devnet-2** we have 25-35 peers on all subnets, so this is more than enough <img width="1537" height="619" alt="Screenshot 2025-07-14 at 11 13 01" src="https://github.com/user-attachments/assets/34c55997-e437-4c53-a829-a9c202a58d1d" /> --------- Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** Remove idea of "advertised custody" completely. Only use target custody and let `earliestAvailableSlot` gate what requests can be made **This is a duplicate of ChainSafe#8027 to merge it into `peerDAS` branch**
**Motivation** fixes ChainSafe#8007 peers that survive across fork boundary have old status in peer manager and ssz.toJson fails if the value is undefined **Duplicate PR as ChainSafe#8021 but merges to `peerDAS` branch** Co-authored-by: Nico Flaig <[email protected]>
**Motivation** - there are hiccup when syncing - it shows weird log: we downloaded some columns but then it appears again in the same batch, see ChainSafe#8036 (comment) **Description** - improve chain log id by tracking a global `nextChainId` - increase `MAX_BATCH_DOWNLOAD_ATTEMPTS` to 20 as we didn't implement rate limit prevention at client side. This should help us sync faster on devnets - ensure the `pendingDataColumns` is decreased over time, this is a nice to have as we should download 1 single batch at a time Closes ChainSafe#8036 **Testing** - was able to sync with `te/fusaka_devnet_2_increase_MAX_BATCH_PROCESSING_ATTEMPT` branch <img width="1561" alt="Screenshot 2025-07-07 at 16 00 31" src="https://github.com/user-attachments/assets/d356de29-9282-4eb2-b607-154f0457e3dc" /> - log is improved with the new id: ``` verbose: SyncChain Error id=Head-10, code=BATCH_ERROR_MAX_DOWNLOAD_ATTEMPTS, startEpoch=2463, status=Downloading ``` --------- Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** The new Grafana PeerDAS dashboard `Lodestar - PeerDAS` helps to monitor PeerDAS. Any internal Lodestar metrics regarding the PeerDAS can be added to this dashboard. Originally it contains the metrics from the [beacon metrics specs ChainSafe#14](ethereum/beacon-metrics#14) but adjusted to Lodestar style and variables. **Description** Beacon metrics added: - Data column gossip: - Data column sidecars submitted for gossip verification; - Data column sidecars verified for gossip; - Column processing failures; - Runtime of data column sidecars gossip verification. - engine_getBlobsV2: - engine_getBlobsV2 requests; - engine_getBlobsV2 responses; - Unsuccessful engine_getBlobsV2 requests rate (%); - Duration of engine_getBlobsV2 requests. - Computation, verification, custody: - Time taken to compute data column sidecar; - Time taken to verify data column sidecar inclusion proof; - Runtime of batched data column kzg verification; - Custody groups; - Backfilled custody groups. - Reconstruction: - Total count of reconstructed columns; - Time taken to reconstruct columns. Lodestar metrics added: - Total number of received data columns by source; - Time elapsed between block slot time and the time data column received. Lodestar reconstruction metrics were updated and added to the dashboard: - Time elapsed between block slot time and the time data column sidecar recovered; - Total number of partial columns cached before recovery; - Total count of recover result of data column sidecars. `lodestar_recover_data_column_sidecar_recover_time_seconds` was transformed into `beacon_data_availability_reconstruction_time_seconds` from the beacon metrics specs and moved to `metrics/beacon.ts`. <!--Steps to reproduce the behavior: Check `Lodestar - PeerDAS` dashboard in Lodestar Grafana. -->
**Motivation** There was an [issue on fusaka-devnet-3 builder flow](https://discord.com/channels/595666850260713488/1399324773758140426) due to failed validator registrations. The problem was that we already sent validator registrations pre-genesis and unless the registration data (fee recipient and gas limit) changes we will keep sending the cached registration data (as per ChainSafe#4447) which reuses the previous timestamp. Why this is problematic is because mev-relay will reject those registrations if [timestamp is before genesis time](https://github.com/flashbots/mev-boost-relay/blob/344d5b67d5940b2bd4e462d790858dd05a445fdc/services/api/service.go#L2903) so it will keep rejecting them until we restart the validator client or change registration data to clear the cache. **Description** Don't send validator registrations pre-genesis, ie. only start sending and caching them once we are in epoch 0. This matches the behavior of Prysm (see OffchainLabs/prysm#12847) which has been there for a while. The only downside I can see is that in epoch 0 we might not be able to produce builder blocks in the first few slots but that is really only relevant for devnets or local kurtosis runs but either way since mev-relay rejects the registrations anyways there is not much we can do in that case.
**Motivation** Based on ChainSafe#8079 our proposer boost reorg implementation works as expected. To get more test coverage and observe it on actual networks it makes sense to enable by default. **Description** Enable proposer boost reorg by default
**Motivation** - ChainSafe#7280 **Description** - update chainsafe/bls dependency (used in light client) to use the newest version of chainsafe/blst
**Motivation** - ethereum/beacon-APIs#546 **Description** Add `GET /eth/v1/beacon/blobs/{block_id}` endpoint to retrieve blobs - pre-fulu we can just return data stored in db - post-fulu we need to reconstruct the blobs from stored data column sidecars
**Motivation** Should only increase cgc. Should not go down per spec
**Motivation** Fixes ChainSafe#8040 During fusaka-devnet-2 an issue was identified. There is a bug in Nimbus where they disconnect us for bad protocol negotiation from lack of StatusV2 support on their end. We get disconnected and then we immediately redial the same peer. See issue ChainSafe#8040. Adds idea of a "cool down" period. Implementation builds on how we do normal score decay. Set the wait time for decay to the desired cool down time. Discovery will check for this before attempting to dial a peer, even upon rediscovery.
This removes `NODE_CUSTODY_REQUIREMENT` config value as it doesn't seem to be part of the spec. It seems like the purpose of this config right now is to be able to pass down supernode requirements (ie. setting requirement to 128) and potentially allow to override the custody requirement for testing but this is a misuse how config should be handled. As discussed with @matthewkeil we don't need to manually set node custody requirement or disable validator custody anymore and rather just rely on `--supernode` flag to custody all groups or dynamic validator custody.
…trics (ChainSafe#8102) **Motivation** During sync we might not yet be in electra, hence `pendingDeposits` is undefined and we mistakenly use clock slot to determine if we should collect electra metrics but it should be based on state slot. ``` $ curl localhost:8008/metrics TypeError: Cannot read properties of undefined (reading 'length') at BeaconChain.onScrapeMetrics (file:///usr/app/packages/beacon-node/src/chain/chain.ts:1123:68) at file:///usr/app/packages/beacon-node/src/chain/chain.ts:395:47 at GaugeExtra.collect (file:///usr/app/packages/beacon-node/src/metrics/utils/gauge.ts:19:7) at GaugeExtra.get (/usr/app/node_modules/prom-client/lib/gauge.js:110:19) at RegistryMetricCreator.getMetricsAsString (/usr/app/node_modules/prom-client/lib/registry.js:35:21) at /usr/app/node_modules/prom-client/lib/registry.js:91:16 at Array.map (<anonymous>) at RegistryMetricCreator.metrics (/usr/app/node_modules/prom-client/lib/registry.js:87:45) at Server.onRequest (file:///usr/app/packages/beacon-node/src/metrics/server/http.ts:47:64) ``` **Description** Use state slot instead of clock slot before gathering electra metrcis
**Motivation** Swallowed errors are bad....
**Motivation** ~~as found in ChainSafe#8082, there is cyclic dependency between `afterProcessEpoch` and `upgradeState*` like `upgradeStateToElectra`~~ this PR improves a couple of things: - to prepare for the future: future contributor can just decide where to put logic at and it's straightforward to them. Cyclic dependency does not happen here because the logic of computing proposers is unchanged for fulu, but it may happen for other fields in the future - at fork boundary, (epoch 0 of fulu), we implicitly run pre-fork logic, only run post-fork logic from epoch 1 and it's not great. In epoch transition we should make everything explicit to make it easier to maintain - performance: we compute proposers for current epoch and next epoch twice at epoch 0 of fulu **Description** - implement finalProcessEpoch(), this is to compute epoch data that depends on `upgradeState*` - call this after `upgradeState*()` --------- Co-authored-by: Tuyen Nguyen <[email protected]> Co-authored-by: Nico Flaig <[email protected]>
) **Motivation** Addressing PR comment for peerDAS branch ChainSafe#6353 (comment)
**Motivation** Address peerDAS PR comment ChainSafe#6353 (comment) Fields are legacy and no longer used anywhere
**Motivation** Addressing peerDAS PR comment ChainSafe#6353 (comment) Changes order of publish to match comment and likely the best order for a sharded DA network. Also adds additional information to the comment
**Motivation** Some logs in the PeerDAS branch are inappropriate. **Description** This PR changes inappropriate PeerDAS log levels from `warn` and `error` to `debug` level. Closes ChainSafe#8107 **Steps to test or reproduce** Run beacon node locally using `./scripts/dev/node*.sh` with --logLevel `info` and `debug` and check the logs. --------- Co-authored-by: Cayman <[email protected]>
**Motivation** - ChainSafe#6353 (comment) **Description** - make `NetworkConfig` a simple object, fix consumers - fix stray import breaking `yarn check-types`
**Motivation** - ChainSafe#6353 (comment) **Description** remove stray event handler on network close
**Motivation** - ChainSafe#6353 (comment) **Description** - add alias to old (pre-peerDAS) cli flag --------- Co-authored-by: Nico Flaig <[email protected]>
…afe#8491) **Motivation** Using `*` as a CLI flag value is not bash-friendly **Description** Add backward-compatible support for the bash-friendly value `all` as an alternative to `*` for the following CLI flags: - `--rest.namespace` - now accepts both `*` and `all` to enable all API namespaces - `--rest.cors` - now accepts both `*` and `all` to allow all origins - `--keymanager.cors` - now accepts both `*` and `all` to allow all origins Closes ChainSafe#5963
… decimal (ChainSafe#8495) - Closes ChainSafe#7710 --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
**Motivation** Recently, we have been experiencing many instances of AI generated slop submitted by first time contributors and a policy rework is needed to ensure our time is respected whilst allowing an external contributor freely contribute to further enhancing this project. This proposal ensures that new contributors self-disclose any use of AI technologies as part of their submission to minimize time waste and allow issues to be resolved with thorough understanding of the output at hand. **Description** This PR: - Adds an AI Assistance Notice to contribution.md - Moves important context for first time contributors to the top of the document - Corrects minor grammar Additional credit to @nflaig and [Ghostly's policy](https://github.com/ghostty-org/ghostty/blob/main/CONTRIBUTING.md#ai-assistance-notice) to approach minimizing this problem. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Default `port6` to the `--port` (ipv4 port) if set explicitly and `--port6` is not set.
**Motivation** - Review of ChainSafe#8468 metrics - In ChainSafe#8449, use of `datastore-level` was unilaterally removed in favor of the bun-supported `datastore-fs` - This caused a regression **Description** - use `datastore-level` by default, only use `datastore-fs` in bun --------- Co-authored-by: Nico Flaig <[email protected]>
**Motivation** To make transition to `getBlobs` api smoother and make Lodestar work with v0.29.0 release of checkpointz ethpandaops/checkpointz#215, we should be serving blob sidecars even post fulu from `getBlobSidecars` api. **Description** Instead of throwing an error, we do the following now post fulu - fetch data column sidecars for block from db (if we custody at least 64 columns) - reconstruct blobs from data column sidecars - reconstruct blob sidecars from blobs (recompute kzg proof and inclusion proof)
**Motivation** - got "heap size limit too low" warn in our Bun instance but has no idea what's the exact value of it **Description** - include `heapSizeLimit` in the log --------- Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** - while investigating ChainSafe#8519 I found a performance issue with `computeDeltas()` where we check for `Set.has()` for every item in the big validator index loop **Description** - sort the `equivocatingIndices` set then track equivocatingValidatorIndex to avoid `Set.has()` there - fix perf test to include some equivocating indices **Benchmark on my local environment** - NodeJS: it's 1.53x faster before ``` computeDeltas ✔ computeDeltas 1400000 validators 300 proto nodes 73.65370 ops/s 13.57705 ms/op - 931 runs 30.1 s ✔ computeDeltas 1400000 validators 1200 proto nodes 73.44709 ops/s 13.61524 ms/op - 922 runs 30.0 s ✔ computeDeltas 1400000 validators 7200 proto nodes 73.59195 ops/s 13.58844 ms/op - 937 runs 30.0 s ✔ computeDeltas 2100000 validators 300 proto nodes 49.27426 ops/s 20.29457 ms/op - 623 runs 30.1 s ✔ computeDeltas 2100000 validators 1200 proto nodes 49.11422 ops/s 20.36070 ms/op - 614 runs 30.1 s ✔ computeDeltas 2100000 validators 7200 proto nodes 48.75805 ops/s 20.50943 ms/op - 619 runs 30.1 s ``` after ``` computeDeltas ✔ computeDeltas 1400000 validators 300 proto nodes 113.6256 ops/s 8.800830 ms/op - 1076 runs 30.1 s ✔ computeDeltas 1400000 validators 1200 proto nodes 112.0909 ops/s 8.921329 ms/op - 1079 runs 30.0 s ✔ computeDeltas 1400000 validators 7200 proto nodes 111.5792 ops/s 8.962247 ms/op - 1068 runs 30.1 s ✔ computeDeltas 2100000 validators 300 proto nodes 75.48259 ops/s 13.24809 ms/op - 727 runs 30.1 s ✔ computeDeltas 2100000 validators 1200 proto nodes 74.93052 ops/s 13.34570 ms/op - 707 runs 30.1 s ✔ computeDeltas 2100000 validators 7200 proto nodes 74.82280 ops/s 13.36491 ms/op - 751 runs 30.0 s ``` - Bun: it's 3.88x faster before ``` computeDeltas ✔ computeDeltas 1400000 validators 300 proto nodes 103.6817 ops/s 9.644905 ms/op x1.578 1791 runs 30.0 s ✔ computeDeltas 1400000 validators 1200 proto nodes 103.4132 ops/s 9.669949 ms/op x1.580 1800 runs 30.1 s ✔ computeDeltas 1400000 validators 7200 proto nodes 103.7312 ops/s 9.640297 ms/op x1.578 1745 runs 30.1 s ✔ computeDeltas 2100000 validators 300 proto nodes 68.86443 ops/s 14.52128 ms/op x1.583 1188 runs 30.0 s ✔ computeDeltas 2100000 validators 1200 proto nodes 68.66082 ops/s 14.56435 ms/op x1.585 1195 runs 30.1 s ✔ computeDeltas 2100000 validators 7200 proto nodes 68.49115 ops/s 14.60043 ms/op x1.592 1194 runs 30.1 s ``` after ``` computeDeltas ✔ computeDeltas 1400000 validators 300 proto nodes 407.0697 ops/s 2.456582 ms/op x0.255 3117 runs 30.1 s ✔ computeDeltas 1400000 validators 1200 proto nodes 402.2402 ops/s 2.486077 ms/op x0.257 2838 runs 30.0 s ✔ computeDeltas 1400000 validators 7200 proto nodes 401.5803 ops/s 2.490162 ms/op x0.258 2852 runs 30.0 s ✔ computeDeltas 2100000 validators 300 proto nodes 265.5509 ops/s 3.765757 ms/op x0.259 1988 runs 30.1 s ✔ computeDeltas 2100000 validators 1200 proto nodes 267.6306 ops/s 3.736494 ms/op x0.257 2026 runs 30.0 s ✔ computeDeltas 2100000 validators 7200 proto nodes 266.0949 ops/s 3.758058 ms/op x0.257 2035 runs 30.1 s ``` Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** Conform to CL spec **Description** - move networking constants to config - remove constants no longer part of spec - enable "p2p-interface.md" in config test Closes ChainSafe#6351 Closes ChainSafe#7529
…e#8486) **Motivation** Voluntary exit validation previously returned only a boolean, which gives a vague error codes and may cause harder debugging. This PR aims to improve debuggability by providing clearer error message and feedback during validator exits. **Description** <!-- A clear and concise general description of the changes of this PR commits --> This PR introduces the VoluntaryExitValidity enum to provide granular reasons for voluntary exit validation failures. It refactors processVoluntaryExit and getVoluntaryExitValidity to return specific validity states, rather than a simple boolean. Beacon node validation logic now maps these validity results to error codes (VoluntaryExitErrorCode) for clearer gossip and API handling. This improves debuggability and aligns exit validation with consensus spec requirements. Closes ChainSafe#6330 --------- Co-authored-by: Nico Flaig <[email protected]>
…fe#8532) **Motivation** If we don't have the proposer index in cache when having to build a block, we create a default entry [here](https://github.com/ChainSafe/lodestar/blob/d9cc6b90f70c4740e9d28b50f01d90d1a25b620e/packages/beacon-node/src/chain/produceBlock/produceBlockBody.ts#L196). This shouldn't happen in normal circumstances as proposers are registered beforehand, however if you produce a block each slot for testing purposes this affects the custody of the node as it will have up to 32 validators in proposer cache (assuming a block each slot) and since we never reduce the cgc it will stay at that value. Logs from Ethpandaops from a node without attached validators but that's producing a block each slot ``` Oct-14 09:12:00.005[chain] verbose: Updated target custody group count finalizedEpoch=272653, validatorCount=32, targetCustodyGroupCount=33 ``` ``` Oct-14 09:12:00.008[network] debug: Updated cgc field in ENR custodyGroupCount=33 ``` **Description** Do not create default cache entry for unknown proposers by using normal `Map` and just fall back to `suggestedFeeRecipient` if there isn't any value. The behavior from a caller perspective stays the same but we no longer create a proposer cache entry for unknown proposers.
**Motivation** - investigate and maintain the performance of `processFinalizedCheckpoint()` - this is part of ChainSafe#8526 **Description** - track duration of `processFinalizedCheckpoint()` by tasks, the result on a hoodi node, it shows that `FrequencyStateArchiveStrategy` takes the most time <img width="941" height="297" alt="Screenshot 2025-10-14 at 13 45 38" src="https://github.com/user-attachments/assets/ef440399-538b-4a4a-a63c-e775745b25e6" /> - track different steps of `FrequencyStateArchiveStrategy`, the result shows that the mainthread is blocked by different db queries cc @wemeetagain <img width="1291" height="657" alt="Screenshot 2025-10-14 at 13 46 36" src="https://github.com/user-attachments/assets/3b19f008-c7d8-49a4-9dc5-e68b1a5ba2a5" /> part of ChainSafe#8526 --------- Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** - track/maintain computeDeltas() metrics **Description** - track computeDeltas() duration, number of 0 deltas, equivocatingValidators, oldInactiveValidators, newInactiveValidators, unchangedVoteValidators, newVoteValidators - as part of investigation of ChainSafe#8519 we want to make sure metrics are the same with Bun part of ChainSafe#8519 **Metrics collected** - hoodi <img width="1020" height="609" alt="Screenshot 2025-10-14 at 15 08 29" src="https://github.com/user-attachments/assets/4b0ab871-ce04-43c9-8b79-73c13a843f4a" /> - mainnet <img width="958" height="617" alt="Screenshot 2025-10-14 at 15 08 54" src="https://github.com/user-attachments/assets/bc264eb1-905d-4f90-bd17-39fb58068608" /> - updateHead() is actually ~5ms increase <img width="842" height="308" alt="Screenshot 2025-10-14 at 15 09 50" src="https://github.com/user-attachments/assets/ec082257-c7c0-4ca3-9908-cd006d23e1de" /> - but it's worth to have more details of `computeDeltas`, we saved ~25ms after ChainSafe#8525 <img width="1461" height="358" alt="Screenshot 2025-10-14 at 15 11 29" src="https://github.com/user-attachments/assets/4cffee0e-d0c0-4f74-a647-59f69af0fd99" /> --------- Co-authored-by: Tuyen Nguyen <[email protected]>
**Motivation** See ethereum/consensus-specs#4650 **Description** Ensure data column sidecars respect blob limit by checking that `kzgCommitments.length` of each data column sidecar does not exceed `getMaxBlobsPerBlock(epoch)`.
This simplifies the calculation of `startEpoch` in head chain range sync. We check the remote finalized epoch against ours here https://github.com/ChainSafe/lodestar/blob/d9cc6b90f70c4740e9d28b50f01d90d1a25b620e/packages/beacon-node/src/sync/utils/remoteSyncType.ts#L33 This means `remote.finalizedEpoch == local.finalizedEpoch` in this case and local head epoch should always be >= finalized epoch which means we can simplify use epoch of `local.headSlot` here. cc @twoeths
edbf854 to
0baa7fa
Compare
c552264 to
7cb8d1d
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This goal of this pr is to restructure and rough-out new BackfillSync class structure.
Backfill issue: #7753
Description
WIP