Skip to content

A Bloom filter to speed up all 404s #776

@roberth

Description

@roberth

Is your feature request related to a problem? Please describe.

Binary caches impose a hidden cost on virtually all uncached builds, because we wait for 404 responses before starting the build. (see alternative 1)

Describe the solution you'd like

Publish a Bloom filter in the binary cache. This could take the form of a regularly updated snapshot, as well as a sequence of "patch" files so that all copies can be kept up to date on a frequent basis.
Depending on the parameters, the Bloom filter would be comfortable in 2GB, given the size of the bucket. For the updates, it's more efficient to send the store path hashes than the indexes of the Bloom filter bits, iiuc, but as I may be ignorant, I'll just say that the upper bound for the updates is 20 bytes per store path.
This suggests that a Bloom filter could be synced world-wide onto

  • ideally, nodes in our current CDN
  • alternatively, a set of our own hosts in various locations
  • anyone who wants to have a very fast cache, e.g. a local reverse proxy or a caching proxy on the local network

Note that the Bloom filter will tell us that either

  • the store object certainly does not exist
  • the store object may exist (with a high probability, assuming suitable parameters)

This means that the probabilistic nature of it does not pose a problem for returning 404s instantly.
When the Bloom filter does "fail" (a collision), it means that query whatever we query today (e.g. the S3 bucket), and still get a 404. Depending on the Bloom filter parameters, this is rare.
This situation can also be created intentionally, by removing a store object that is recorded in the Bloom filter.
Alternatively, this can be considered an acceptable cost, e.g. when performing a garbage collection, but not replacing the Bloom filter if the number of deletions is insignificant.

Describe alternatives you've considered

  1. We could start builds speculatively, but only if all dependencies are already available. Otherwise we're making things way worse for the 200 OK case, where we want to avoid fetching derivation inputs that aren't part of the output.

  2. Instead of a Bloom filter, or in addition, we could publish the list of cached store paths for download. This is far less efficient, but possible useful for other purposes.

  3. Alternatives exist for the Bloom filter data structure. I have not analyzed the differences. Note that incremental updates are important for delivering new store paths quickly and efficiently.

Additional context

  • I'm probably not the first to advocate for this idea, but I have no idea who to credit besides @edef1c whom I've chatted about this some time ago.
  • This could replace or supplement the easier partial solution cache 404s and purge cache on upload #769

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions