Skip to content

Conversation

victorjulien
Copy link
Member

Allow rejecting both sides of a connection. Has the same support as regular reject (which is essentially rejectsrc).

Ticket: #5974.

https://redmine.openinfosecfoundation.org/issues/5974

Allow rejecting both sides of a connection. Has the same support
as regular reject (which is essentially rejectsrc).

Ticket: OISF#5974.
@victorjulien victorjulien requested a review from a team as a code owner October 17, 2025 11:27
Comment on lines -36 to +37
#define EXCEPTION_POLICY_MAX (EXCEPTION_POLICY_REJECT + 1)
#define EXCEPTION_POLICY_MAX (EXCEPTION_POLICY_REJECT_BOTH + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks prone to error if we forget to update it. Should we make MAX a part of the enum above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but that then has as a consequence that all the runtime code paths need to handle the size value in switch statements.

} else if (strcmp(value_str, "reject") == 0) {
policy = EXCEPTION_POLICY_REJECT;
} else if (strcmp(value_str, "reject-both") == 0) {
policy = EXCEPTION_POLICY_REJECT_BOTH;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github won't let me comment on the unchanged parts of this file, but a few lines below this on line 248, the FatalErrorOnInit message should be updated to mention reject-both as a valid value now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to also update docs to mention this (including the suricata.yaml bits)

case EXCEPTION_POLICY_BYPASS_FLOW:
return "bypass";
case EXCEPTION_POLICY_REJECT_BOTH:
return "reject_both";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be reject-both (instead of reject_both) to be consistent with the value that gets set in the yaml, or does it need to be conditional on is_json like the others below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, in the YAML config we use -, but for JSON field names we use _.

}
if (!EngineModeIsIPS()) {
break;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially controversial suggestion below but it is a requirement for our environment so we would still need to patch this in if not accepted.

The fallthrough to drop-flow below results in an edge case that can cause this not to work when packet loss is present in the network. When the midstream reject-both is triggered specifically by a retransmitted packet from the server side, the reset that gets sent to the client will not be accepted by the client host because its sequence number is no longer current. This is the scenario:

  1. Server sends a data packet to client with sequence number A and B bytes of data
  2. Client sends an ACK to server for A+B
  3. That ACK gets lost in the network before reaching Suricata
  4. Suricata flow state gets lost
  5. Server times out and retransmits the last data packet with sequence A and B bytes of data to client
  6. Suricata receives it and does not have matching flow state, so it resets both sides.
  7. The reset sent to the server side is accepted and resets the server end of the connection
  8. The reset sent to the client side is not accepted because it uses sequence number A, which is no longer in its window because the client has already ACK'ed A+B
  9. Client application stuck hanging as it retries, because all of its sent packets now get dropped quietly by drop-flow behavior.

At very large scale this unlikely sequence of events happens enough to cause a major problem for us. Our solution is effectively to disable the drop-flow behavior by making line 164 below look like this:

            if (policy == EXCEPTION_POLICY_DROP_FLOW && p->flow) {

This avoids setting the FLOW_ACTION_DROP on these flows, which causes the first client retry (step 9 above) to also trigger a reset which resets the client so it no longer hangs. I looked at this situation for a while and concluded that it's not possible for Suricata to always use the right sequence number in the presence of packet loss in the network, so the only other option would be for reject-both to send one reset to the src (which is always correct) and two to the dst, one of which would be accepted depending on whether the last ACK happened to be lost.

Note that there is a potential to create something like a DDoS reflection/amplification vector with the solution above, if Suricata responds to every unknown packet by generating two resets. The details of our infrastructure makes that not possible for us, but in general it is when Suricata is directly reachable from untrusted sources. So you would probably want to do something like disable that "reset everything" behavior from the untrusted side of the network (presumably EXTERNAL_NET) or just simply make it configurable via a YAML option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Suricata flow state gets lost

How does this happen if the flow is presumably in the established state? If the timeout that is configured very low?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us it's a failover operation where we see this happen - so a host level hardware failure causes existing flows to be redirected to other redundant Suricata hosts, and when the existing connections land on those new hosts the Suricata process running there doesn't know about them. It's actually the failover operation that makes it likely for one of the packets to be lost too - so while the sequence of events described above is possible in general for other reasons too (process crashing at the right time, etc) it goes from "extremely unlikely" to "likely enough to cause an active problem" in our environment.

We do have the ability to migrate connections with their state between hosts in controlled circumstances, but for uncontrolled ones (unplanned host failure) resetting those connections is the next best choice which allows the endpoints to fail fast and establish new connections.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you expect "persistent TCP resets" (https://redmine.openinfosecfoundation.org/issues/960) to help with this? The idea being that we'll keep sending RST packets as long as we keep receiving more packets. There should perhaps be some kind of backoff mechanism to it, but it could help address the race condition?

Copy link

codecov bot commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 69.23077% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.43%. Comparing base (2f633be) to head (54f52e3).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14060      +/-   ##
==========================================
- Coverage   84.44%   84.43%   -0.02%     
==========================================
  Files        1011     1011              
  Lines      272282   272292      +10     
==========================================
- Hits       229923   229898      -25     
- Misses      42359    42394      +35     
Flag Coverage Δ
fuzzcorpus 63.39% <0.00%> (-0.01%) ⬇️
livemode 19.35% <30.76%> (-0.01%) ⬇️
pcap 44.76% <7.69%> (-0.03%) ⬇️
suricata-verify 65.20% <69.23%> (-0.01%) ⬇️
unittests 59.44% <7.69%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suricata-qa
Copy link

WARNING:

field baseline test %
SURI_TLPR1_stats_chk
.uptime 654 675 103.21%

Pipeline = 28014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants