-
Notifications
You must be signed in to change notification settings - Fork 1.6k
exception-policy: add 'reject-both' option #14060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Allow rejecting both sides of a connection. Has the same support as regular reject (which is essentially rejectsrc). Ticket: OISF#5974.
#define EXCEPTION_POLICY_MAX (EXCEPTION_POLICY_REJECT + 1) | ||
#define EXCEPTION_POLICY_MAX (EXCEPTION_POLICY_REJECT_BOTH + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks prone to error if we forget to update it. Should we make MAX a part of the enum above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but that then has as a consequence that all the runtime code paths need to handle the size value in switch statements.
} else if (strcmp(value_str, "reject") == 0) { | ||
policy = EXCEPTION_POLICY_REJECT; | ||
} else if (strcmp(value_str, "reject-both") == 0) { | ||
policy = EXCEPTION_POLICY_REJECT_BOTH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github won't let me comment on the unchanged parts of this file, but a few lines below this on line 248, the FatalErrorOnInit
message should be updated to mention reject-both as a valid value now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to also update docs to mention this (including the suricata.yaml bits)
case EXCEPTION_POLICY_BYPASS_FLOW: | ||
return "bypass"; | ||
case EXCEPTION_POLICY_REJECT_BOTH: | ||
return "reject_both"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that be reject-both
(instead of reject_both
) to be consistent with the value that gets set in the yaml, or does it need to be conditional on is_json
like the others below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, in the YAML config we use -
, but for JSON field names we use _
.
} | ||
if (!EngineModeIsIPS()) { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially controversial suggestion below but it is a requirement for our environment so we would still need to patch this in if not accepted.
The fallthrough to drop-flow below results in an edge case that can cause this not to work when packet loss is present in the network. When the midstream reject-both is triggered specifically by a retransmitted packet from the server side, the reset that gets sent to the client will not be accepted by the client host because its sequence number is no longer current. This is the scenario:
- Server sends a data packet to client with sequence number A and B bytes of data
- Client sends an ACK to server for A+B
- That ACK gets lost in the network before reaching Suricata
- Suricata flow state gets lost
- Server times out and retransmits the last data packet with sequence A and B bytes of data to client
- Suricata receives it and does not have matching flow state, so it resets both sides.
- The reset sent to the server side is accepted and resets the server end of the connection
- The reset sent to the client side is not accepted because it uses sequence number A, which is no longer in its window because the client has already ACK'ed A+B
- Client application stuck hanging as it retries, because all of its sent packets now get dropped quietly by drop-flow behavior.
At very large scale this unlikely sequence of events happens enough to cause a major problem for us. Our solution is effectively to disable the drop-flow behavior by making line 164 below look like this:
if (policy == EXCEPTION_POLICY_DROP_FLOW && p->flow) {
This avoids setting the FLOW_ACTION_DROP
on these flows, which causes the first client retry (step 9 above) to also trigger a reset which resets the client so it no longer hangs. I looked at this situation for a while and concluded that it's not possible for Suricata to always use the right sequence number in the presence of packet loss in the network, so the only other option would be for reject-both to send one reset to the src
(which is always correct) and two to the dst
, one of which would be accepted depending on whether the last ACK happened to be lost.
Note that there is a potential to create something like a DDoS reflection/amplification vector with the solution above, if Suricata responds to every unknown packet by generating two resets. The details of our infrastructure makes that not possible for us, but in general it is when Suricata is directly reachable from untrusted sources. So you would probably want to do something like disable that "reset everything" behavior from the untrusted side of the network (presumably EXTERNAL_NET
) or just simply make it configurable via a YAML option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Suricata flow state gets lost
How does this happen if the flow is presumably in the established state? If the timeout that is configured very low?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us it's a failover operation where we see this happen - so a host level hardware failure causes existing flows to be redirected to other redundant Suricata hosts, and when the existing connections land on those new hosts the Suricata process running there doesn't know about them. It's actually the failover operation that makes it likely for one of the packets to be lost too - so while the sequence of events described above is possible in general for other reasons too (process crashing at the right time, etc) it goes from "extremely unlikely" to "likely enough to cause an active problem" in our environment.
We do have the ability to migrate connections with their state between hosts in controlled circumstances, but for uncontrolled ones (unplanned host failure) resetting those connections is the next best choice which allows the endpoints to fail fast and establish new connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you expect "persistent TCP resets" (https://redmine.openinfosecfoundation.org/issues/960) to help with this? The idea being that we'll keep sending RST packets as long as we keep receiving more packets. There should perhaps be some kind of backoff mechanism to it, but it could help address the race condition?
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #14060 +/- ##
==========================================
- Coverage 84.44% 84.43% -0.02%
==========================================
Files 1011 1011
Lines 272282 272292 +10
==========================================
- Hits 229923 229898 -25
- Misses 42359 42394 +35
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
WARNING:
Pipeline = 28014 |
Allow rejecting both sides of a connection. Has the same support as regular reject (which is essentially rejectsrc).
Ticket: #5974.
https://redmine.openinfosecfoundation.org/issues/5974