Skip to content

Conversation

@orestisfl
Copy link
Contributor

Proposed commit message

The two log messages are logged from different goroutines that race:

  • "Closing reader of filestream" is logged from a background goroutine spawned by ctxtool.WithFunc when streamCancel() triggers cancellation
  • "Stopped harvester for file" is logged from the main harvester goroutine via a defer in harvester.go

When streamCancel() executes, it closes a channel that wakes the background goroutine to log "Closing reader", while the main goroutine continues and logs "Stopped harvester". These two goroutines race to write to the log file, making the order non-deterministic.

The original test used sequential WaitLogsContains calls which track file offset - when messages appeared in the "wrong" order, the first check would advance the offset past both messages, causing the second to fail.

Changed to WaitLogsContainsAnyOrder which checks for both messages without relying on order.

Closes #47784

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

How to test this PR locally

  1. Start the integration test containers:

    cd filebeat && mage docker:composeUp
  2. Build the test binary:

    mage buildSystemTestBinary
  3. Run the specific test with FIPS mode (reproduces original failure without fix):

    GODEBUG=fips140=only \
    ES_HOST=localhost ES_USER=beats ES_PASS=testing \
    ES_SUPERUSER_USER=admin ES_SUPERUSER_PASS=testing \
    go test -v -failfast -tags "integration,requirefips" \
      -run "TestFilestreamDelete/Inactive_resource_not_finished_and_data_added_during_grace_period" \
      ./tests/integration/ -count=20
  4. Clean up:

    mage docker:composeDown

Related issues

The two log messages are logged from different goroutines that race:

- "Closing reader of filestream" is logged from a background goroutine
  spawned by ctxtool.WithFunc when streamCancel() triggers cancellation
- "Stopped harvester for file" is logged from the main harvester goroutine
  via a defer in harvester.go

When streamCancel() executes, it closes a channel that wakes the background
goroutine to log "Closing reader", while the main goroutine continues and
logs "Stopped harvester". These two goroutines race to write to the log
file, making the order non-deterministic.

The original test used sequential WaitLogsContains calls which track file
offset - when messages appeared in the "wrong" order, the first check
would advance the offset past both messages, causing the second to fail.

Changed to WaitLogsContainsAnyOrder which checks for both messages without
relying on order.

Closes elastic#47784
@orestisfl orestisfl self-assigned this Dec 1, 2025
@orestisfl orestisfl requested a review from a team as a code owner December 1, 2025 13:55
@orestisfl orestisfl added bug flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Dec 1, 2025
@orestisfl orestisfl added the backport-active-all Automated backport with mergify to all the active branches label Dec 1, 2025
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 1, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 1, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@orestisfl orestisfl requested a review from belimawr December 1, 2025 13:55
@orestisfl orestisfl added release-note:skip The PR should be ignored when processing the changelog skip-changelog and removed release-note:skip The PR should be ignored when processing the changelog labels Dec 1, 2025
@orestisfl orestisfl merged commit 5ce630a into elastic:main Dec 3, 2025
35 of 43 checks passed
@orestisfl orestisfl deleted the fix-TestFilestreamDelete branch December 3, 2025 18:06
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

@Mergifyio backport 8.19 9.1 9.2

@mergify
Copy link
Contributor

mergify bot commented Dec 3, 2025

backport 8.19 9.1 9.2

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Dec 3, 2025
## Proposed commit message

The two log messages are logged from different goroutines that race:

- "Closing reader of filestream" is logged from a background goroutine spawned by ctxtool.WithFunc when streamCancel() triggers cancellation
- "Stopped harvester for file" is logged from the main harvester goroutine via a defer in harvester.go

When streamCancel() executes, it closes a channel that wakes the background goroutine to log "Closing reader", while the main goroutine continues and logs "Stopped harvester". These two goroutines race to write to the log file, making the order non-deterministic.

The original test used sequential WaitLogsContains calls which track file offset - when messages appeared in the "wrong" order, the first check would advance the offset past both messages, causing the second to fail.

Changed to WaitLogsContainsAnyOrder which checks for both messages without relying on order.

Closes #47784

## How to test this PR locally

1. Start the integration test containers:
   ```bash
   cd filebeat && mage docker:composeUp
   ```

2. Build the test binary:
   ```bash
   mage buildSystemTestBinary
   ```

3. Run the specific test with FIPS mode (reproduces original failure without fix):
   ```bash
   GODEBUG=fips140=only \
   ES_HOST=localhost ES_USER=beats ES_PASS=testing \
   ES_SUPERUSER_USER=admin ES_SUPERUSER_PASS=testing \
   go test -v -failfast -tags "integration,requirefips" \
     -run "TestFilestreamDelete/Inactive_resource_not_finished_and_data_added_during_grace_period" \
     ./tests/integration/ -count=20
   ```

4. Clean up:
   ```bash
   mage docker:composeDown
   ```

(cherry picked from commit 5ce630a)

# Conflicts:
#	filebeat/tests/integration/filestream_delete_test.go
mergify bot pushed a commit that referenced this pull request Dec 3, 2025
## Proposed commit message

The two log messages are logged from different goroutines that race:

- "Closing reader of filestream" is logged from a background goroutine spawned by ctxtool.WithFunc when streamCancel() triggers cancellation
- "Stopped harvester for file" is logged from the main harvester goroutine via a defer in harvester.go

When streamCancel() executes, it closes a channel that wakes the background goroutine to log "Closing reader", while the main goroutine continues and logs "Stopped harvester". These two goroutines race to write to the log file, making the order non-deterministic.

The original test used sequential WaitLogsContains calls which track file offset - when messages appeared in the "wrong" order, the first check would advance the offset past both messages, causing the second to fail.

Changed to WaitLogsContainsAnyOrder which checks for both messages without relying on order.

Closes #47784

## How to test this PR locally

1. Start the integration test containers:
   ```bash
   cd filebeat && mage docker:composeUp
   ```

2. Build the test binary:
   ```bash
   mage buildSystemTestBinary
   ```

3. Run the specific test with FIPS mode (reproduces original failure without fix):
   ```bash
   GODEBUG=fips140=only \
   ES_HOST=localhost ES_USER=beats ES_PASS=testing \
   ES_SUPERUSER_USER=admin ES_SUPERUSER_PASS=testing \
   go test -v -failfast -tags "integration,requirefips" \
     -run "TestFilestreamDelete/Inactive_resource_not_finished_and_data_added_during_grace_period" \
     ./tests/integration/ -count=20
   ```

4. Clean up:
   ```bash
   mage docker:composeDown
   ```

(cherry picked from commit 5ce630a)

# Conflicts:
#	filebeat/tests/integration/filestream_delete_test.go
mergify bot pushed a commit that referenced this pull request Dec 3, 2025
## Proposed commit message

The two log messages are logged from different goroutines that race:

- "Closing reader of filestream" is logged from a background goroutine spawned by ctxtool.WithFunc when streamCancel() triggers cancellation
- "Stopped harvester for file" is logged from the main harvester goroutine via a defer in harvester.go

When streamCancel() executes, it closes a channel that wakes the background goroutine to log "Closing reader", while the main goroutine continues and logs "Stopped harvester". These two goroutines race to write to the log file, making the order non-deterministic.

The original test used sequential WaitLogsContains calls which track file offset - when messages appeared in the "wrong" order, the first check would advance the offset past both messages, causing the second to fail.

Changed to WaitLogsContainsAnyOrder which checks for both messages without relying on order.

Closes #47784

## How to test this PR locally

1. Start the integration test containers:
   ```bash
   cd filebeat && mage docker:composeUp
   ```

2. Build the test binary:
   ```bash
   mage buildSystemTestBinary
   ```

3. Run the specific test with FIPS mode (reproduces original failure without fix):
   ```bash
   GODEBUG=fips140=only \
   ES_HOST=localhost ES_USER=beats ES_PASS=testing \
   ES_SUPERUSER_USER=admin ES_SUPERUSER_PASS=testing \
   go test -v -failfast -tags "integration,requirefips" \
     -run "TestFilestreamDelete/Inactive_resource_not_finished_and_data_added_during_grace_period" \
     ./tests/integration/ -count=20
   ```

4. Clean up:
   ```bash
   mage docker:composeDown
   ```

(cherry picked from commit 5ce630a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug flaky-test Unstable or unreliable test cases. skip-changelog Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestFilestreamDelete/Inactive_resource_not_finished_and_data_added_during_grace_period/grace_period_is_interrupted_when_file_changes

3 participants