-
Notifications
You must be signed in to change notification settings - Fork 127
OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands #3052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BhargaviGudi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @BhargaviGudi. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3052 +/- ##
===========================================
- Coverage 45.50% 24.17% -21.33%
===========================================
Files 79 125 +46
Lines 7782 17822 +10040
===========================================
+ Hits 3541 4309 +768
- Misses 4099 13229 +9130
- Partials 142 284 +142 🚀 New features to boost your workflow:
|
|
/ok-to-test |
96a2136 to
69ba88b
Compare
|
/retest-required |
|
@BhargaviGudi do you mind a rebase? |
| e.retryBpfRequestUID(logBucket) | ||
|
|
||
| // Retry container info lookup if it's still missing | ||
| e.retryContainerInfo(ctx, logBucket, processID, nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BhargaviGudi Thanks for the PR. Can you try to test if e.retryContainerInfo(ctx, logBucket, processID, nodeName) alone can fix the issue without the retryBpfCmdLine and retryBpfRequestUID? I feel missing container info is the cause of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with only retryContainerInfo as suggested, but it doesn't fix the issue for short-lived processes.
- The short-lived processes exit so fast that by the time the LogBucket expires (30 seconds later), their /proc/ is gone and are missing from audit log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngopalak-redhat Could you please help me review the PR? Thanks
69ba88b to
613d628
Compare
613d628 to
99e4cbb
Compare
|
/test pull-security-profiles-operator-verify |
|
@BhargaviGudi thank you for the PR! Please rebase to fix the CI. |
…for rapid commands OCPBUGS-62269 - Fix GitHub Actions linting issues OCPBUGS-62269 - Fix GitHub Actions linting issues
99e4cbb to
b75d4d8
Compare
|
@ngopalak-redhat Could you please help me review the PR? Thanks |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a race condition in the JSON enricher that causes audit log entries to be missing or incomplete when commands are executed in rapid succession (e.g.,
touch test1 && touch test2 && touch test3).Problem:
When short-lived processes execute quickly, audit events arrive at the JSON enricher before the corresponding BPF events have been processed from the ring buffer. This results in:
cmdLinefieldsrequestUIDresourceinformation (pod/namespace/container)Solution:
Implements a retry mechanism in
dispatchSeccompLine()that attempts to fetch missing information multiple times with progressive delays (0ms, 10ms, 50ms):Since LogBuckets remain in cache for 60 seconds before being written to the audit log, there's ample time for retries to succeed while BPF events are processed asynchronously.
Testing Results:
Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-62269
Does this PR have test?
Yes - The fix was validated through:
oc execandoc rsh)No new automated tests added as this is a timing-dependent race condition that's difficult to reliably reproduce in unit tests.
Special notes for your reviewer:
jsonenricher.go- no modifications to BPF components or ring buffer configurationDoes this PR introduce a user-facing change?