Optimize file scanning in Filebeat's filestream

**Describe the enhancement:**

Looks like we exclude files in a [wrong place](https://github.com/elastic/beats/blob/2898137255fa4f4943aa48b38bbf16afa8ffdbd2/filebeat/input/filestream/fswatch.go#L552-L558).

When we match the [glob expression](https://github.com/elastic/beats/blob/2898137255fa4f4943aa48b38bbf16afa8ffdbd2/filebeat/input/filestream/fswatch.go#L485) we allocate memory for all the file paths matching the glob and we iterate through all of them checking the "excluded files" filter. I think we should not even add paths to this list if they're excluded during the glob resolution.

Perhaps we should introduce our optimized glob implementation that does not even list excluded files.

Another alternative would be a glob resolution as an iterator pattern (accepts a function for each iteration). Never allocates the entire list in memory, works file by file.

**Describe a specific use case for the enhancement or feature:**

Some users set a very broad glob expression in the `path` config. This glob expression may match hundreds of thousands of files. Users expect that setting a few patterns in the `exclude_files` would illuminate all unnecessary files. It does, however, in a very inefficient way.

Perhaps we can use some already existing implementations or borrow some principles.

https://burntsushi.net/ripgrep/ has the best overview, and links to some competing tools in Go like https://github.com/monochromegane/the_platinum_searcher/tree/master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize file scanning in Filebeat's filestream #48686

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize file scanning in Filebeat's filestream #48686

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions