Skip to content

enhancement(sinks): Add support for max_bytes for memory buffers #23330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Jul 2, 2025

Summary

This PR adds support for memory buffers to be bound in terms of bytes allocated. This is an opt-in feature which is defaulted to false, meaning that the current implementation and its defaults will still be selected when not explicitly supplying a value for max_bytes.

At the core of this change is a new interface that allowes the selection of a different lock-free queue. This queue is at the center of the implementation of memory buffers. Today that queue is crossbeam_queue::ArrayQueue which is a fixed-sized lock-free data structure. This queue being fixed size is the reason that #8679 could not easily be implemented. The new interface allows to drop in a non-fixed sized queue. The crossbeam_queue::SegQueue was chosen, since it showed to be performant in initial testing and didn't require the inclusion of any new dependencies.

The main resource (queue) is already guarded by a semaphore. This semaphore currently bounds the queue by number of elements but there's no reason for why it couldn't guard against bytes allocated, therefore much of that existing code remains the same - which is positive as it is already battle tested and seems relatively stable as is.

Finally a new unit test was added and new benchmarks included in vector-buffers/benches.

Vector configuration

To any sink configuration try:

buffer:
  type: memory
  max_bytes: 123456

How did you test this PR?

Via the existing unit test and the developed benchmarks

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

@graphcareful graphcareful requested review from a team as code owners July 2, 2025 17:56
@graphcareful graphcareful requested review from bruceg and removed request for a team July 2, 2025 17:56
@github-actions github-actions bot added domain: topology Anything related to Vector's topology code domain: external docs Anything related to Vector's external, public documentation labels Jul 2, 2025
@graphcareful graphcareful force-pushed the rob/buffer-size-bytes branch from e3ef011 to bdc9f1d Compare July 2, 2025 17:59
Copilot

This comment was marked as outdated.

@pront pront added the domain: buffers Anything related to Vector's memory/disk buffers label Jul 2, 2025
#[serde(default = "memory_buffer_default_max_events")]
max_events: NonZeroUsize,
/// The terms around how to express buffering limits, can be in size or bytes_size.
size: MemoryBufferSize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this might introduce a little friction at least in the docs. Currently the memory buffering options are flat, but introducing an enum here will open up another nested set of options.

Confusingly there is a custom deserializer above that will allow a user to just use max_bytes and things will work.

Should we just have two optional values here? I went with the enum because its essentially the same type that will be passed onto the method that chooses the implementation, but after seeing how the docs look i'm thinking about changing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that Rob and I discussed this and he's planning to rework the config handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thank you for sharing. We will take a look after all the open comments are addressed.

#[serde(default = "memory_buffer_default_max_events")]
max_events: NonZeroUsize,
/// The terms around how to express buffering limits, can be in size or bytes_size.
size: MemoryBufferSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that Rob and I discussed this and he's planning to rework the config handling.

@pront pront requested a review from Copilot July 7, 2025 18:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for byte-based limits for in-memory buffers by introducing a unified MemoryBufferSize enum and dynamically selecting between element-count and byte-size queues.

  • Replace standalone max_events with a size object backed by MemoryBufferSize across configs and APIs
  • Implement QueueImpl to choose between ArrayQueue (by events) and SegQueue (by bytes) using a semaphore guard
  • Update all tests, examples, benchmarks, and documentation to use the new byte/event buffer sizing model

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
website/cue/reference/components/base/sinks.cue Update CUE schema to use new size object with max_bytes & max_size
src/topology/test/backpressure.rs Adapt backpressure tests to MemoryBufferSize
src/test_util/mock/sources/basic.rs Use MemoryBufferSize in mock sources
src/source_sender/mod.rs Initialize limited channel with MemoryBufferSize
lib/vector-buffers/src/variants/in_memory.rs Refactor MemoryBuffer to accept MemoryBufferSize
lib/vector-buffers/src/topology/test_util.rs Enhance Sample to account for heap-allocated bytes
lib/vector-buffers/src/topology/channel/limited_queue.rs Add QueueImpl, SizeTerms, & dynamic queue selection
lib/vector-buffers/src/topology/builder.rs Update topology builder to use MemoryBufferSize
lib/vector-buffers/src/test/variant.rs Update variant tests to use MemoryBufferSize
lib/vector-buffers/src/lib.rs Export MemoryBufferSize
lib/vector-buffers/src/config.rs Implement serde de/serialization for MemoryBufferSize
lib/vector-buffers/examples/buffer_perf.rs Adjust example to use new buffer size API
lib/vector-buffers/benches/sized_records.rs Add benchmarks for byte-based buffers via a BoundBy helper
lib/vector-buffers/benches/common.rs Extend Message to simulate heap allocation for size-based tests
changelog.d/8679_add_support_max_bytes_memory_buffers.feature.md Add changelog entry for the new max_bytes feature
Comments suppressed due to low confidence (3)

lib/vector-buffers/src/config.rs:102

  • [nitpick] The error message for invalid max_bytes is unclear and grammatically awkward; consider rephrasing to something like "max_bytes must fit within the platform's usize range" and include the actual bounds.
                            &"For memory buffers max_bytes expects an integer within the range of 268435488 and your architecture dependent usize",

lib/vector-buffers/src/config.rs:430

  • Add a unit test in the config module to verify that a YAML or JSON config using max_bytes correctly deserializes into MemoryBufferSize::MaxSize under BufferType::Memory.
        for stage in self.stages() {

lib/vector-buffers/src/config.rs:207

  • [nitpick] The variant name MaxSize is ambiguous; consider renaming it to MaxBytes to clearly indicate it represents a byte-based limit.
    MaxSize {

- Also removing its configurable_component tag as it is no longer
officially part of the configuration
@graphcareful graphcareful requested review from bruceg and tobz July 8, 2025 20:15
@@ -186,11 +244,18 @@ pub enum BufferType {
/// This is more performant, but less durable. Data will be lost if Vector is restarted
/// forcefully or crashes.
#[configurable(title = "Events are buffered in memory.")]
#[serde(rename = "memory")]
#[serde(rename = "memory", serialize_with = "serialize_memory_config")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this serialize differently than what serde natively generates for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously max_events was not an optional. Now it is, so if its not present in the schema natively the system will generate max_bytes:null which causes the custom deserializer to fail because it expects an integer if this key exists.

I don't believe this is an issue other then tests because that is the only case where a schema is serialized by Vector, correct me if i'm wrong though.

Comment on lines +249 to +258
// Two size options instead of a variant had been chosen to denote respective sizes so
// that the generated documentation will output what the parser will expect, a flat
// non-nested layout where exactly one of the two must be provided or else a default for
// max_events will be chosen.
/// The maximum number of events allowed in the buffer.
#[serde(default = "memory_buffer_default_max_events")]
max_events: NonZeroUsize,
max_events: Option<NonZeroUsize>,

/// The maximum size across all events allowed in the buffer.
#[configurable(metadata(docs::type_unit = "bytes"))]
max_size: Option<NonZeroUsize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you could still accomplish the flattening with an enum:

        #[serde(default, flatten)]
        size: EventsOrSize,

#[serde(rename_all = "snake_case")]
enum EventsOrSize {
    MaxEvents(NonZeroUsize),
    MaxSize(NonZeroUsize),
}

Oh look, that's actually MemoryBufferSize which just needs the serde annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave that a shot, it was not generating the documentation as intended though

@@ -273,7 +275,7 @@ mod tests {
async fn single_stage_topology_block() {
let mut builder = TopologyBuilder::<Sample>::default();
builder.stage(
MemoryBuffer::new(NonZeroUsize::new(1).unwrap()),
MemoryBuffer::with_max_events(NonZeroUsize::new(1).unwrap()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if some of the tests shouldn't also exercise the max_bytes mode. Not sure…

- Replace if let chain with match expression
- Replace map/sum with just calls to +
- Replace function pointer in limited_queue.rs with enum + variant check
@graphcareful graphcareful requested a review from bruceg July 10, 2025 15:34
@pront pront force-pushed the master branch 4 times, most recently from 1720078 to ffe54be Compare July 10, 2025 15:43
@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jul 10, 2025
@pront
Copy link
Member

pront commented Jul 10, 2025

Note: base dir was renamed to generated. The easiest fix is to rebase on master (ignore branch changes) and regenerated the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: buffers Anything related to Vector's memory/disk buffers domain: external docs Anything related to Vector's external, public documentation domain: topology Anything related to Vector's topology code meta: awaiting author Pull requests that are awaiting their author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support max_bytes for memory buffers
4 participants