Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Dec 3, 2025

Lazy Initialization of the Cache Processor's File Store

The Problem

The basic problem is that processors often use paths.Resolve to find directories like "data" or "logs". This function uses a global variable for the base path, which is fine when a Beat runs as a standalone process.

But when a Beat is embedded as a receiver (e.g., fbreceiver in the OTel Collector), this global causes problems. Each receiver needs its own isolated state directory, and a single global path prevents this.

The cache processor currently tries to set up its file-based store in its New function, which is too early. It only has access to the global path, not the receiver-specific path that gets configured later.

The Solution

My solution is to initialize the cache's file store lazily.

Instead of creating the store in cache.New, I've added a SetPaths(*paths.Path) method to the processor. This method creates the file store and is wrapped in a sync.Once to make sure it only runs once. The processor's internal store object stays nil until SetPaths is called during pipeline construction.

How it Works

The path info gets passed down when a client connects to the pipeline. Here's the flow:

  1. x-pack/filebeat/fbreceiver: createReceiver instantiates the processors (including cache with a nil store) and calls instance.NewBeatForReceiver.
  2. x-pack/libbeat/cmd/instance: NewBeatForReceiver creates the paths.Path object from the receiver's specific configuration.
  3. libbeat/publisher/pipeline: This paths.Path object is passed into the pipeline. When a client connects, the path is added to the beat.ProcessingConfig.
  4. libbeat/publisher/processing: The processing builder gets this config and calls group.SetPaths, which passes the path down to each processor.
  5. libbeat/processors/cache: SetPaths is finally called on the cache processor instance, and the sync.Once guard ensures the file store is created with the correct path.

Diagram

graph TD
    subgraph "libbeat/processors/cache (init)"
        A["init()"]
    end
    subgraph "libbeat/processors"
        B["processors.RegisterPlugin"]
        C{"registry"}
    end
    A --> B;
    B -- "Save factory" --> C;

    subgraph "x-pack/filebeat/fbreceiver"
        D["createReceiver"]
    end

    subgraph "libbeat/processors"
         E["processors.New(config)"]
         C -. "Lookup 'cache'" .-> E;
    end
    D --> E;
    D --> I;
    E --> G;

    subgraph "libbeat/processors/cache"
        G["cache.New()"] -- store=nil --> H{"cache"};
    end

    subgraph "x-pack/libbeat/cmd/instance"
        I["instance.NewBeatForReceiver"];
        I --> J{"paths.Path object"};
    end

    subgraph "libbeat/publisher/pipeline"
        J --> K["pipeline.New"];
        K --> L["ConnectWith"];
    end

    subgraph "libbeat/publisher/processing"
        L -- "Config w/ paths" --> N["builder.Create"];
        N --> O["group.SetPaths"];
    end

    subgraph "libbeat/processors/cache"
        O --> P["cache.SetPaths"];
        P --> Q["sync.Once"];
        Q -- "initialize store" --> H;
    end
Loading

Pros and Cons of This Approach

  • Pros:
    • It's a minimal, targeted change that solves the immediate problem.
    • It avoids a large-scale, breaking refactoring of all processors.
    • It maintains backward compatibility for existing processors and downstream consumers of libbeat.
  • Cons:
    • Using a type assertion for the setPaths interface feels a bit like magic, since the behavior changes at runtime depending on whether a processor implements it.

Alternatives Considered

Option 1: Add a paths argument to all processor constructors

  • Pros:
    • Simple and direct.
  • Cons:
    • Requires a global refactoring of all processors.
    • Breaks external downstream libbeat importers like Cloudbeat.
    • The paths argument is not needed in many processors, so adding a rarely used option to the function signature is verbose.

Option 2: Refactor processors to introduce a "V2" interface

  • Pros:
    • Allows for a new, backwards-compatible signature (e.g., using a config struct).
    • This can still be done later.
    • We can support both V1 processors and gradually move processors to V2.
  • Cons:
    • Needs a significant refactoring effort.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

How to test this PR locally

Configuration

filebeat-cache-mwe.yml:

path.data: /tmp/data

filebeat.inputs:
  - type: filestream
    id: filestream-input
    enabled: true
    paths:
      - /tmp/logs/*.log
    parsers:
      - ndjson:
          target: ""

processors:
  # PUT: Store metadata when event.type is "source"
  - if:
      equals:
        event.type: "source"
    then:
      - cache:
          backend:
            file:
              id: test_cache
              write_interval: 5s
          put:
            key_field: event.id
            value_field: event.metadata
            ttl: 1h

  # GET: Retrieve metadata when event.type is "target"
  - if:
      equals:
        event.type: "target"
    then:
      - cache:
          backend:
            file:
              id: test_cache
          get:
            key_field: event.id
            target_field: cached_metadata

output.console:
  enabled: true

Setup

# Create directory
#rm -rf /tmp/data /tmp/logs
mkdir -p /tmp/logs

# Create test data
cat > /tmp/logs/test.log <<'EOF'
{"event":{"type":"source","id":"001","metadata":{"user":"user-1","role":"admin","sequence":1,"data":{"ip":"192.168.1.1","session":"session-001"}}},"message":"source event 1"}
{"event":{"type":"source","id":"002","metadata":{"user":"user-2","role":"admin","sequence":2,"data":{"ip":"192.168.1.2","session":"session-002"}}},"message":"source event 2"}
{"event":{"type":"source","id":"003","metadata":{"user":"user-3","role":"admin","sequence":3,"data":{"ip":"192.168.1.3","session":"session-003"}}},"message":"source event 3"}
{"event":{"type":"source","id":"004","metadata":{"user":"user-4","role":"admin","sequence":4,"data":{"ip":"192.168.1.4","session":"session-004"}}},"message":"source event 4"}
{"event":{"type":"source","id":"005","metadata":{"user":"user-5","role":"admin","sequence":5,"data":{"ip":"192.168.1.5","session":"session-005"}}},"message":"source event 5"}
{"event":{"type":"target","id":"001"},"message":"target event 1"}
{"event":{"type":"target","id":"002"},"message":"target event 2"}
{"event":{"type":"target","id":"003"},"message":"target event 3"}
{"event":{"type":"target","id":"004"},"message":"target event 4"}
{"event":{"type":"target","id":"005"},"message":"target event 5"}
EOF

# Run filebeat
./x-pack/filebeat/filebeat -e -c filebeat-cache-mwe.yml

Expected Output

Target events should have cached_metadata field populated:

{
  "event": {
    "type": "target",
    "id": "001"
  },
  "message": "target event 1",
  "cached_metadata": {
    "user": "user-1",
    "role": "admin",
    "sequence": 1,
    "data": {
      "ip": "192.168.1.1",
      "session": "session-001"
    }
  }
}

Cache Files

After running filebeat, check cache files:

cat /tmp/data/cache_processor/test_cache

example:

{"key":"001","val":{"data":{"ip":"192.168.1.1","session":"session-001"},"role":"admin","sequence":1,"user":"user-1"},"expires":"2025-11-20T15:02:32.865896537+01:00"}
{"key":"002","val":{"data":{"ip":"192.168.1.2","session":"session-002"},"role":"admin","sequence":2,"user":"user-2"},"expires":"2025-11-20T15:02:32.865950973+01:00"}
{"key":"003","val":{"data":{"ip":"192.168.1.3","session":"session-003"},"role":"admin","sequence":3,"user":"user-3"},"expires":"2025-11-20T15:02:32.865972408+01:00"}
{"key":"004","val":{"data":{"ip":"192.168.1.4","session":"session-004"},"role":"admin","sequence":4,"user":"user-4"},"expires":"2025-11-20T15:02:32.865988843+01:00"}
{"key":"005","val":{"data":{"ip":"192.168.1.5","session":"session-005"},"role":"admin","sequence":5,"user":"user-5"},"expires":"2025-11-20T15:02:32.866006958+01:00"}

Related issues

# Lazy Initialization of the Cache Processor's File Store

## The Problem

The basic problem is that processors often use `paths.Resolve` to find directories like "data" or "logs". This function uses a global variable for the base path, which is fine when a Beat runs as a standalone process.

But when a Beat is embedded as a receiver (e.g., `fbreceiver` in the OTel Collector), this global causes problems. Each receiver needs its own isolated state directory, and a single global path prevents this.

The `cache` processor currently tries to set up its file-based store in its `New` function, which is too early. It only has access to the global path, not the receiver-specific path that gets configured later.

## The Solution

My solution is to initialize the cache's file store lazily.

Instead of creating the store in `cache.New`, I've added a `SetPaths(*paths.Path)` method to the processor. This method creates the file store and is wrapped in a `sync.Once` to make sure it only runs once. The processor's internal store object stays `nil` until `SetPaths` is called during pipeline construction.

## How it Works

The path info gets passed down when a client connects to the pipeline. Here's the flow:

1.  **`x-pack/filebeat/fbreceiver`**: `createReceiver` instantiates the processors (including `cache` with a `nil` store) and calls `instance.NewBeatForReceiver`.
2.  **`x-pack/libbeat/cmd/instance`**: `NewBeatForReceiver` creates the `paths.Path` object from the receiver's specific configuration.
3.  **`libbeat/publisher/pipeline`**: This `paths.Path` object is passed into the pipeline. When a client connects, the path is added to the `beat.ProcessingConfig`.
4.  **`libbeat/publisher/processing`**: The processing builder gets this config and calls `group.SetPaths`, which passes the path down to each processor.
5.  **`libbeat/processors/cache`**: `SetPaths` is finally called on the cache processor instance, and the `sync.Once` guard ensures the file store is created with the correct path.

## Diagram
```mermaid
graph TD
    subgraph "libbeat/processors/cache (init)"
        A["init()"]
    end
    subgraph "libbeat/processors"
        B["processors.RegisterPlugin"]
        C{"registry"}
    end
    A --> B;
    B -- "Save factory" --> C;

    subgraph "x-pack/filebeat/fbreceiver"
        D["createReceiver"]
    end

    subgraph "libbeat/processors"
         E["processors.New(config)"]
         C -. "Lookup 'cache'" .-> E;
    end
    D --> E;
    D --> I;
    E --> G;

    subgraph "libbeat/processors/cache"
        G["cache.New()"] -- store=nil --> H{"cache"};
    end

    subgraph "x-pack/libbeat/cmd/instance"
        I["instance.NewBeatForReceiver"];
        I --> J{"paths.Path object"};
    end

    subgraph "libbeat/publisher/pipeline"
        J --> K["pipeline.New"];
        K --> L["ConnectWith"];
    end

    subgraph "libbeat/publisher/processing"
        L -- "Config w/ paths" --> N["builder.Create"];
        N --> O["group.SetPaths"];
    end

    subgraph "libbeat/processors/cache"
        O --> P["cache.SetPaths"];
        P --> Q["sync.Once"];
        Q -- "initialize store" --> H;
    end
```

## Pros and Cons of This Approach

*   **Pros**:
    *   It's a minimal, targeted change that solves the immediate problem.
    *   It avoids a large-scale, breaking refactoring of all processors.
    *   It maintains backward compatibility for existing processors and downstream consumers of `libbeat`.
*   **Cons**:
    *   Using a type assertion for the `setPaths` interface feels a bit like magic, since the behavior changes at runtime depending on whether a processor implements it.

## Alternatives Considered

### Option 1: Add a `paths` argument to all processor constructors

*   **Pros**:
    *   Simple and direct.
*   **Cons**:
    *   Requires a global refactoring of all processors.
    *   Breaks external downstream libbeat importers like Cloudbeat.
    *   The `paths` argument is not needed in many processors, so adding a rarely used option to the function signature is verbose.

### Option 2: Refactor `processors` to introduce a "V2" interface

*   **Pros**:
    *   Allows for a new, backwards-compatible signature (e.g., using a config struct).
    *   This can still be done later.
    *   We can support both V1 processors and gradually move processors to V2.
*   **Cons**:
    *   Needs a significant refactoring effort.

## Checklist

<!-- Mandatory
Add a checklist of things that are required to be reviewed in order to have the PR approved

List here all the items you have verified BEFORE sending this PR. Please DO NOT remove any item, striking through those that do not apply. (Just in case, strikethrough uses two tildes. ~~Scratch this.~~)
-->

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] ~~I have made corresponding changes to the documentation~~
- [ ] ~~I have made corresponding change to the default configuration files~~
- [x] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the [`stresstest.sh`](https://github.com/elastic/beats/blob/main/script/stresstest.sh) script to run them under stress conditions and race detector to verify their stability.
- [ ] ~~I have added an entry in `./changelog/fragments` using the [changelog tool](https://github.com/elastic/elastic-agent-changelog-tool/blob/main/docs/usage.md).~~

## How to test this PR locally
### Configuration

`filebeat-cache-mwe.yml`:

```yaml
path.data: /tmp/data

filebeat.inputs:
  - type: filestream
    id: filestream-input
    enabled: true
    paths:
      - /tmp/logs/*.log
    parsers:
      - ndjson:
          target: ""

processors:
  # PUT: Store metadata when event.type is "source"
  - if:
      equals:
        event.type: "source"
    then:
      - cache:
          backend:
            file:
              id: test_cache
              write_interval: 5s
          put:
            key_field: event.id
            value_field: event.metadata
            ttl: 1h

  # GET: Retrieve metadata when event.type is "target"
  - if:
      equals:
        event.type: "target"
    then:
      - cache:
          backend:
            file:
              id: test_cache
          get:
            key_field: event.id
            target_field: cached_metadata

output.console:
  enabled: true
```

### Setup

```bash
# Create directory
#rm -rf /tmp/data /tmp/logs
mkdir -p /tmp/logs

# Create test data
cat > /tmp/logs/test.log <<'EOF'
{"event":{"type":"source","id":"001","metadata":{"user":"user-1","role":"admin","sequence":1,"data":{"ip":"192.168.1.1","session":"session-001"}}},"message":"source event 1"}
{"event":{"type":"source","id":"002","metadata":{"user":"user-2","role":"admin","sequence":2,"data":{"ip":"192.168.1.2","session":"session-002"}}},"message":"source event 2"}
{"event":{"type":"source","id":"003","metadata":{"user":"user-3","role":"admin","sequence":3,"data":{"ip":"192.168.1.3","session":"session-003"}}},"message":"source event 3"}
{"event":{"type":"source","id":"004","metadata":{"user":"user-4","role":"admin","sequence":4,"data":{"ip":"192.168.1.4","session":"session-004"}}},"message":"source event 4"}
{"event":{"type":"source","id":"005","metadata":{"user":"user-5","role":"admin","sequence":5,"data":{"ip":"192.168.1.5","session":"session-005"}}},"message":"source event 5"}
{"event":{"type":"target","id":"001"},"message":"target event 1"}
{"event":{"type":"target","id":"002"},"message":"target event 2"}
{"event":{"type":"target","id":"003"},"message":"target event 3"}
{"event":{"type":"target","id":"004"},"message":"target event 4"}
{"event":{"type":"target","id":"005"},"message":"target event 5"}
EOF

# Run filebeat
./x-pack/filebeat/filebeat -e -c filebeat-cache-mwe.yml
```

### Expected Output

Target events should have `cached_metadata` field populated:

```json
{
  "event": {
    "type": "target",
    "id": "001"
  },
  "message": "target event 1",
  "cached_metadata": {
    "user": "user-1",
    "role": "admin",
    "sequence": 1,
    "data": {
      "ip": "192.168.1.1",
      "session": "session-001"
    }
  }
}
```

### Cache Files

After running filebeat, check cache files:

```bash
cat /tmp/data/cache_processor/test_cache
```
example:
```json
{"key":"001","val":{"data":{"ip":"192.168.1.1","session":"session-001"},"role":"admin","sequence":1,"user":"user-1"},"expires":"2025-11-20T15:02:32.865896537+01:00"}
{"key":"002","val":{"data":{"ip":"192.168.1.2","session":"session-002"},"role":"admin","sequence":2,"user":"user-2"},"expires":"2025-11-20T15:02:32.865950973+01:00"}
{"key":"003","val":{"data":{"ip":"192.168.1.3","session":"session-003"},"role":"admin","sequence":3,"user":"user-3"},"expires":"2025-11-20T15:02:32.865972408+01:00"}
{"key":"004","val":{"data":{"ip":"192.168.1.4","session":"session-004"},"role":"admin","sequence":4,"user":"user-4"},"expires":"2025-11-20T15:02:32.865988843+01:00"}
{"key":"005","val":{"data":{"ip":"192.168.1.5","session":"session-005"},"role":"admin","sequence":5,"user":"user-5"},"expires":"2025-11-20T15:02:32.866006958+01:00"}
```

## Related issues

- Closes #46985

(cherry picked from commit 28222c4)
@mergify mergify bot requested a review from a team as a code owner December 3, 2025 14:20
@mergify mergify bot added the backport label Dec 3, 2025
@mergify mergify bot requested a review from a team as a code owner December 3, 2025 14:20
@mergify mergify bot removed the request for review from a team December 3, 2025 14:20
@mergify mergify bot added the backport label Dec 3, 2025
@mergify mergify bot requested review from faec and mauri870 December 3, 2025 14:20
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 3, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 3, 2025
@orestisfl orestisfl enabled auto-merge (squash) December 3, 2025 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants