Ban socket addresses not sending a valid connection ID #1096

josecelano · 2024-11-20T11:33:22Z

From: torrust/torrust-demo#14
Relates to: #1033

We are having problems with the tracker demo. The logs contain many errors validating the connection ID. It looks like the client doesn't implement the protocol correctly because it's not sending the connection ID received from the connect request. Since the client is making many requests, this produces a lot of new ERROR records in the logs, ultimately depressing tracker performance.

Solution Overview:

Hierarchical Counting Bloom Filters:
- Individual IP Layer: Use one CBF to track individual IP addresses. This will allow for fine-grained detection of misbehaving IPs.
- Subnet Layer: Implement another level of CBFs where each filter covers a subnet range. This allows for detecting patterns of misbehavior across a subnet without penalizing neighboring IPs inappropriately.

Implementation Details:

Individual IP CBF:
- Each IP address is hashed into this filter.
- When an error occurs, increment the count associated with that IP's hash in the CBF.
- If the count exceeds a threshold, you can take action against that specific IP (e.g., rate limiting or temporary banning).
Subnet CBF:
- Instead of hashing individual IPs, hash the subnet address (e.g., the network part of an IP address).
- When multiple IPs within a subnet misbehave, their errors are aggregated into this subnet's count.
- If the count for a subnet exceeds a different, higher threshold, you can apply measures at the subnet level, like rate limiting traffic from that subnet.

Advantages:

Granularity: This approach gives you both detailed control over individual IPs and the ability to detect broader patterns of misbehavior within subnets.
Performance: CBFs are efficient in terms of memory usage and speed, allowing for quick lookups and updates even with large datasets.
Flexibility: You can adjust the thresholds for individual IPs and subnets separately, allowing for different levels of tolerance based on your policy or observed behavior patterns.
False Positives: While CBFs have a small chance of false positives, by using multiple levels (individual and subnet), you can mitigate the impact. For example, if an IP is flagged at both levels, it's more likely to be a true positive.

Challenges:

Configuration: You need to decide on the size of the CBFs, the number of hash functions, and the error thresholds for both individual IPs and subnets. This requires some experimentation or simulation to find the right balance between false positives, memory usage, and effectiveness.
Complexity: Managing two layers of CBFs introduces additional complexity in terms of implementation and maintenance.
False Positives at Subnet Level: If a subnet contains both misbehaving and well-behaving IPs, the well-behaving ones might suffer from the actions taken against the subnet.

Implementation Steps:

Decide on the Subnet Size: Determine what constitutes a subnet for your purposes (e.g., /24, /16, etc.).
Initialize CBFs:
- Create one CBF for individual IPs.
- Create another CBF for subnets, where each bucket represents a subnet.
Error Handling Logic:
- When an error occurs:
  - Hash the IP to update the individual IP CBF.
  - Extract the subnet from the IP and hash it to update the subnet CBF.
Action Protocol:
- If an individual IP's count exceeds a threshold, apply rate limiting or other measures to that IP.
- If a subnet's count exceeds a higher threshold, consider similar measures but at the subnet level.
Decay Mechanism: Implement a decay or aging process for counts to ensure that past behavior doesn't indefinitely affect current interactions unless the behavior persists.

By employing this hierarchical approach with Counting Bloom Filters, you can effectively manage IP-based errors at different levels of granularity, protecting your network's performance while minimizing the impact on innocent IPs.

Originally posted by @da2ce7 in torrust/torrust-demo#14 (comment)

The text was updated successfully, but these errors were encountered:

josecelano · 2024-12-05T18:33:05Z

Hi @da2ce7 I guess you proposed a Counting Bloom Filters mainly because:

We can have too many misbehaving clients.
Time needed either to add items or to check whether an item is in the set is a fixed constant, O(k).
They consume less memory than other alternatives.

Notes

There are many implementations in Rust: https://www.reddit.com/r/rust/comments/10y9t9v/there_are_87_bloom_filter_crates_strategies_for/
Xor Filters can't be used because they don't allow adding new entries without rebuilding.
Cuckoo Filters don't support counting natively.

Preliminary research

Crates containing Counting Bloom Filters:

Crates apparently not containing Counting Bloom Filters:

Some explanations:

Papers:

josecelano · 2024-12-05T18:50:12Z

Here's a comparison of three Rust crates that implement Counting Bloom Filters:

Crate Name	GitHub Stars	Number of Contributors	Initial Release Date	Latest Commit Date	Crates.io Downloads	Used by	Notable Users
fastbloom	86	2	2 years ago	December 2023 (crate updated 1 year ago)	66,454	N/A	N/A
bloom	26	1	10 years ago	Sep 2016 (crate updated 8 years ago)	540,063	N/A	N/A
bloom-filters	7	4	6 years ago	Jun 2021 (crate updated over 3 years ago)	203,647	292	Nervos Network

Notes:

fastbloom: A fast Bloom filter implemented in Rust, with Python bindings available. It supports both standard and counting Bloom filters.
bloom-rs: Provides standard and counting Bloom filters. Last updated in 2016, indicating potential lack of recent maintenance.
bloom-filters: A fast Bloom filter implementation in Rust, primarily maintained by the Nervos Network team.

Data is based on available information as of December 2024.

josecelano · 2024-12-07T13:14:25Z

Hi @da2ce7 I think I'm going to implement it in two phases, first the IPs and then the subnets. ~~I will probably use the bloom-filters crate~~.

NOTES/QUESTIONS:

I think it will be hard to find a good size for the subnet in the subnet filter. I guess you wanted to introduce this level to avoid DoS attacks and not only bad client implementations. Maybe you are assuming all those bad implementations are actually attacks. Is there a reason why an attacker could use many IPs in the same subnet? I'm trying to understand why this would be effective without banning many false positives.
Should we use socket addresses instead of IPs? I don't think so.

…imit If the client does not send the rigth conenction ID more than 10 times it's banned. In this first implementation after sending 10 times a wrong connection ID. They are only unabnned when the tracker is restarted.

…imit If the client does not send the rigth conenction ID more than 10 times it's banned. In this first implementation after sending 10 times a wrong connection ID. They are only unbanned when the tracker is restarted.

…limit If the client does not send the rigth conenction ID more than 10 times it's banned. In this first implementation after sending 10 times a wrong connection ID. They are only unbanned when the tracker is restarted.

josecelano · 2024-12-09T12:19:00Z

Hi @da2ce7, I've implemented the minimal solution here.

When should we unban an IP?

I think we can unban all IPs every 24 hours (cbf.clear();). ~~We can wrap the filter with a type that resets the inner CBF (deletes it and creates a new one) every 24 hours~~. What do you think?

I have more questions in my previous comment ☝🏼.

…tured Instead of captured the mapped error in the caller function when the error is already converted into a UDP error reponse. This prevents from parsing the error message to filter the error we are interesting in.

…on IDs

…ion IDs

…ry hour

Running the cleaner check on each iteration decreased the UDP tracker performance.

…limit If the client does not send the rigth conenction ID more than 10 times it's banned. In this first implementation after sending 10 times a wrong connection ID. They are only unbanned when the tracker is restarted.

…tured Instead of captured the mapped error in the caller function when the error is already converted into a UDP error reponse. This prevents from parsing the error message to filter the error we are interesting in.

…ion IDs

…ry hour

Running the cleaner check on each iteration decreased the UDP tracker performance.

josecelano · 2024-12-12T16:15:00Z

Today, we discussed this issue in our weekly meeting.

@da2ce7 said the false negatives rate is too high. @da2ce7 I forgot to mention that you can set the rate:

The frequency of false positives can be preciecly bounded by setting the size of the filter, and is called the False Positive Rate.

See https://docs.rs/bloom/latest/bloom/index.html#bloom-filters.

@da2ce7 also commented there is an open issue for a long time about the high False-positive rate:

False-positive rate much higher due to broken HashIter implementation nicklan/bloom-rs#2

We were also considering using another crate that I discarded because It does not have an implementation for a Counting Bloom Filter. The crate:

https://github.com/tomtomwombat/fastbloom

NOTE: It has the same name as the other one I was analyzing, but they are actually two different packages:

I've opened a new issue to ask them for their plans to add a Counting Bloom Filter feature.

We also discussed alternative implementations to remove false positives. I will describe the solution in a new comment.

josecelano · 2024-12-12T16:54:28Z

1. Alternative Implementation: Two-Tiered Approach

The basic idea is to use the CBF just as a fast filter to detect potencial bad actors. If the counter for an IP goes over a threshold, we don't ban the IP directly. Instead, we add the IP to a reliable secondary list with a HashMap.

Counting Bloom Filter as a Fast Filter:

Use the CBF to estimate potential misbehaving IPs quickly.
Increment the counter in the CBF for each bad request from an IP.
When the counter for an IP exceeds 10 in the CBF, move that IP to a more reliable structure for precise counting.

Reliable Backend Structure:

Use a HashMap (or another reliable key-value store) for precise counting of IPs that are flagged as misbehaving by the CBF.
In the HashMap, count up to 10 precise errors and only ban the IP when the count reaches 10.
Once an IP is banned, you no longer need to query it in the HashMap (or the CBF), which helps reduce the overhead.

False Positive Handling:

False positives in the CBF will only lead to lookups in the HashMap but will not result in incorrect bans.
This ensures no false negatives because the actual banning decision is always based on the precise counts in the HashMap.

Unban Handling:

We can remove IPs from the HashMap periodically or after a period for that concrete IP. We can include a timestamp for when the ban started.

Pros

No false positives. No client is banned accidentally.

Cons

For potentially misbehaving IPS, we have to double check it, by accessing two data structures. I wonder if that wouldn't be more costly than just replying with the error message. In the end, we don't even need to get data from the main torrent repository, which is, I think, one of the main bottlenecks.
If there are many bad actors, that can lead to another type of attack: memory consumption. But that's a problem we have anyway for normal requests.

Questions

When should we clean the CBF? I think @da2ce7 proposed not to clean it because the bucket might contain more than one IP.
@da2ce7 I think this was not exactly your idea because you mentioned something about the IP hash. Could you correct this description of the implementation?

josecelano · 2024-12-13T18:42:34Z

In the new implementation add a new metric to tracker stats for the number of banned IPs.

…limit If the client does not send the rigth conenction ID more than 10 times it's banned. In this first implementation after sending 10 times a wrong connection ID. They are only unbanned when the tracker is restarted.

…tured Instead of captured the mapped error in the caller function when the error is already converted into a UDP error reponse. This prevents from parsing the error message to filter the error we are interesting in.

…ion IDs

…ry hour

Running the cleaner check on each iteration decreased the UDP tracker performance.

We are using a Counting Bloom Filter to count IPs sending wrong connections IDs. IPs are banned after sending 10 wrong connections IDs. CBFs are fast and use litle memory but they are also innaccurate. They have False Positives meaning some IPs would be banned only becuase there are bucket colissions (IPs sharing the same counter). To avoid banning IPs incorrectly we decided to introduce a second counter, which is a HashMap counting error is a exact way. IPs are only banned when this counter reaches the limit. We keep the CBF as a first level filter. It's a fast check to filter IPs without affecting tracker's performance. When the IP is banned according tho the first filter we start a counter for that IP in the second exact counter. This solution should be good if the number of IPs is low. We have to find another solution anyway for IPv6 where is cheaper to own a range if IPs.

Since the new solution with a HashMap consumes more memory, we should keep hte banning list short. The drawback is clients will be allowed to send more wrong connections IDs. However, sending 10 requests with wrong connection IDs every 2 minutos should not affect much the performance, unless we have many IPs, and in that case we would have a problem with memory anyway. In the future Sys Admin could inject this via a setting value.

…limit The life demo tracker is receiving many UDP requests with a wrong conenctions IDs. Errors are logged (write disk) and that decreases the tracker performance. This counts errors and bans Ips after 10 errors for 2 minutes. We use two levels of counters. 1. First level: A Counting Bloom Filter: fast and low memory consumption but innacurate (False Positives). 2. HashMap: Exact Counter for Ips. CBFs are fast and use litle memory but they are also innaccurate. They have False Positives meaning some IPs would be banned only becuase there are bucket colissions (IPs sharing the same counter). To avoid banning IPs incorrectly we decided to introduce a second counter, which is a HashMap that counts error precisely. IPs are only banned when this counter reaches the limit (over 10 errors). We keep the CBF as a first level filter. It's a fast-check IP filter without affecting tracker's performance. When the IP is banned according to the first filter we double-check in the HashMap. CBF is faster than checking always for banned IPs against the HashMap. This solution should be good if the number of IPs is low. We have to find another solution anyway for IPv6 where is cheaper to own a range of IPs.

29e506d feat: use default aquatic udp port for benchmarking (Jose Celano) 10f9bda feat: [#1096] ban client IP when exceeds connection ID errors limit (Jose Celano) 87401e8 chore(deps): add dependency bloom (Jose Celano) Pull request description: This PR uses a [Counting Bloom Filter](https://docs.rs/bloom/latest/bloom/#counting-bloom-filters) to count IP sending UDP requests with wrong connection IDs. The IP is banned when the tracker receives more than 10 requests from a given IP with a bad connection ID. Bad connection IDs are cookie values that have expired or are from the future. With the current `CountingBloomFilter` configuration (0.01 rate), we would have a **False Positive** for every 10000 IPs, meaning when two IPs have a collision, and one of them is misbehaving, the other one would also be banned. To avoid false positives, we introduced a second counter with a HashMap. This consumes more memory, but it's reset every 120 seconds. The HashMap is only used when the CBF detects a potential bad client. ### TODO - [x] Straightforward implementation - [x] Benchmarking (how much this new feature affects performance) - [x] Add an E2E test - [x] Remove IPs from the banned list every hour - [x] Review filter settings `CountingBloomFilter::with_rate(4, 0.01, 100)` - [x] Refactor: extract the IP ban service from the main loop - [x] Benchmarking after extracting `BanService` ### Questions - [ ] Should we add a configuration option for the maximum number of errors allowed? ### Future PR - [ ] Add a metric to tracker stats for the number of banned IPs. - [ ] Ban subnets ACKs for top commit: josecelano: ACK 29e506d Tree-SHA512: 004959e00eced1b9c1de39de81f8f9f1d8da1b46f5ee38b3b0679e77cc40448525ac197145ace5dd62017c39a72f7175b06f556e6a7eb8cffbdc57f67052a856

josecelano added Enhancement / Feature Request Something New - Admin - Enjoyable to Install and Setup our Software Optimization Make it Faster labels Nov 20, 2024

josecelano self-assigned this Dec 5, 2024

josecelano linked a pull request Dec 9, 2024 that will close this issue

Feat: ban IPs not sending a valid connection ID #1124

Merged

10 tasks

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024

test: [torrust#1096] add E2E test for banning IP sending bad connecti…

b94359f

…on IDs

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024

test: [torrust#1096] add E2E test for banning IPs sending bad connect…

446a906

…ion IDs

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024

feat: [torrust#1096] reset IP banning list for connections errors eve…

2690bcd

…ry hour

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024

feat: [torrust#1096] reset IP banning list for connections errors eve…

4801edf

…ry hour

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024

refactor: [torrust#1096] extract BanService

befeee9

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024

refactor: [torrust#1096] run IP bans cleaner to a new thread

010a2e5

Running the cleaner check on each iteration decreased the UDP tracker performance.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024

refactor: [torrust#1096] run IP bans cleaner to a new thread

77cf089

Running the cleaner check on each iteration decreased the UDP tracker performance.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024

docs: [torrust#1096] add mod doc the banning mod

d539959

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024

docs: [torrust#1096] add mod doc the banning mod

7b4ec75

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024

test: [torrust#1096] add E2E test for banning IPs sending bad connect…

18f6e71

…ion IDs

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024

feat: [torrust#1096] reset IP banning list for connections errors eve…

8d9ae4c

…ry hour

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024

refactor: [torrust#1096] extract BanService

b63eb2b

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024

refactor: [torrust#1096] run IP bans cleaner to a new thread

a848f40

Running the cleaner check on each iteration decreased the UDP tracker performance.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024

docs: [torrust#1096] add mod doc the banning mod

88b447f

josecelano mentioned this issue Dec 12, 2024

Counting Bloom Filter? tomtomwombat/fastbloom#10

Open

josecelano mentioned this issue Dec 12, 2024

Feat: ban IPs not sending a valid connection ID #1124

Merged

10 tasks

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024

test: [torrust#1096] add E2E test for banning IPs sending bad connect…

842e36c

…ion IDs

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024

feat: [torrust#1096] reset IP banning list for connections errors eve…

cde6e26

…ry hour

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024

refactor: [torrust#1096] extract BanService

c5ca079

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024

refactor: [torrust#1096] run IP bans cleaner to a new thread

a3cd856

Running the cleaner check on each iteration decreased the UDP tracker performance.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024

docs: [torrust#1096] add mod doc the banning mod

26c05e5

josecelano closed this as completed in #1124 Dec 17, 2024

josecelano mentioned this issue Dec 17, 2024

Consider config option to ignore connection ID expiration in UDP tracker #1136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ban socket addresses not sending a valid connection ID #1096

Ban socket addresses not sending a valid connection ID #1096

josecelano commented Nov 20, 2024 •

edited

Loading

josecelano commented Dec 5, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 5, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 7, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 9, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 12, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 12, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 13, 2024

Uh oh!

Ban socket addresses not sending a valid connection ID #1096

Ban socket addresses not sending a valid connection ID #1096

Comments

josecelano commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solution Overview:

Implementation Details:

Advantages:

Challenges:

Implementation Steps:

josecelano commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Preliminary research

Uh oh!

josecelano commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josecelano commented Dec 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josecelano commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josecelano commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josecelano commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Alternative Implementation: Two-Tiered Approach

Uh oh!

josecelano commented Dec 13, 2024

Uh oh!

josecelano commented Nov 20, 2024 •

edited

Loading

josecelano commented Dec 5, 2024 •

edited

Loading

josecelano commented Dec 5, 2024 •

edited

Loading

josecelano commented Dec 7, 2024 •

edited

Loading

josecelano commented Dec 9, 2024 •

edited

Loading

josecelano commented Dec 12, 2024 •

edited

Loading

josecelano commented Dec 12, 2024 •

edited

Loading