Skip to content

Ban socket addresses not sending a valid connection ID #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
josecelano opened this issue Nov 20, 2024 · 7 comments · Fixed by #1124
Closed

Ban socket addresses not sending a valid connection ID #1096

josecelano opened this issue Nov 20, 2024 · 7 comments · Fixed by #1124
Assignees
Labels
- Admin - Enjoyable to Install and Setup our Software Enhancement / Feature Request Something New Optimization Make it Faster

Comments

@josecelano
Copy link
Member

josecelano commented Nov 20, 2024

From: torrust/torrust-demo#14
Relates to: #1033

We are having problems with the tracker demo. The logs contain many errors validating the connection ID. It looks like the client doesn't implement the protocol correctly because it's not sending the connection ID received from the connect request. Since the client is making many requests, this produces a lot of new ERROR records in the logs, ultimately depressing tracker performance.

Solution Overview:

  1. Hierarchical Counting Bloom Filters:

    • Individual IP Layer: Use one CBF to track individual IP addresses. This will allow for fine-grained detection of misbehaving IPs.

    • Subnet Layer: Implement another level of CBFs where each filter covers a subnet range. This allows for detecting patterns of misbehavior across a subnet without penalizing neighboring IPs inappropriately.

Implementation Details:

  • Individual IP CBF:

    • Each IP address is hashed into this filter.
    • When an error occurs, increment the count associated with that IP's hash in the CBF.
    • If the count exceeds a threshold, you can take action against that specific IP (e.g., rate limiting or temporary banning).
  • Subnet CBF:

    • Instead of hashing individual IPs, hash the subnet address (e.g., the network part of an IP address).
    • When multiple IPs within a subnet misbehave, their errors are aggregated into this subnet's count.
    • If the count for a subnet exceeds a different, higher threshold, you can apply measures at the subnet level, like rate limiting traffic from that subnet.

Advantages:

  • Granularity: This approach gives you both detailed control over individual IPs and the ability to detect broader patterns of misbehavior within subnets.

  • Performance: CBFs are efficient in terms of memory usage and speed, allowing for quick lookups and updates even with large datasets.

  • Flexibility: You can adjust the thresholds for individual IPs and subnets separately, allowing for different levels of tolerance based on your policy or observed behavior patterns.

  • False Positives: While CBFs have a small chance of false positives, by using multiple levels (individual and subnet), you can mitigate the impact. For example, if an IP is flagged at both levels, it's more likely to be a true positive.

Challenges:

  • Configuration: You need to decide on the size of the CBFs, the number of hash functions, and the error thresholds for both individual IPs and subnets. This requires some experimentation or simulation to find the right balance between false positives, memory usage, and effectiveness.

  • Complexity: Managing two layers of CBFs introduces additional complexity in terms of implementation and maintenance.

  • False Positives at Subnet Level: If a subnet contains both misbehaving and well-behaving IPs, the well-behaving ones might suffer from the actions taken against the subnet.

Implementation Steps:

  1. Decide on the Subnet Size: Determine what constitutes a subnet for your purposes (e.g., /24, /16, etc.).

  2. Initialize CBFs:

    • Create one CBF for individual IPs.
    • Create another CBF for subnets, where each bucket represents a subnet.
  3. Error Handling Logic:

    • When an error occurs:
      • Hash the IP to update the individual IP CBF.
      • Extract the subnet from the IP and hash it to update the subnet CBF.
  4. Action Protocol:

    • If an individual IP's count exceeds a threshold, apply rate limiting or other measures to that IP.
    • If a subnet's count exceeds a higher threshold, consider similar measures but at the subnet level.
  5. Decay Mechanism: Implement a decay or aging process for counts to ensure that past behavior doesn't indefinitely affect current interactions unless the behavior persists.

By employing this hierarchical approach with Counting Bloom Filters, you can effectively manage IP-based errors at different levels of granularity, protecting your network's performance while minimizing the impact on innocent IPs.

Originally posted by @da2ce7 in torrust/torrust-demo#14 (comment)

@josecelano josecelano added Enhancement / Feature Request Something New - Admin - Enjoyable to Install and Setup our Software Optimization Make it Faster labels Nov 20, 2024
@josecelano josecelano self-assigned this Dec 5, 2024
@josecelano
Copy link
Member Author

josecelano commented Dec 5, 2024

Hi @da2ce7 I guess you proposed a Counting Bloom Filters mainly because:

  • We can have too many misbehaving clients.
  • Time needed either to add items or to check whether an item is in the set is a fixed constant, O(k).
  • They consume less memory than other alternatives.

Notes

Preliminary research

Crates containing Counting Bloom Filters:

Crates apparently not containing Counting Bloom Filters:

Some explanations:

Papers:

@josecelano
Copy link
Member Author

josecelano commented Dec 5, 2024

Here's a comparison of three Rust crates that implement Counting Bloom Filters:

Crate Name GitHub Stars Number of Contributors Initial Release Date Latest Commit Date Crates.io Downloads Used by Notable Users
fastbloom 86 2 2 years ago December 2023 (crate updated 1 year ago) 66,454 N/A N/A
bloom 26 1 10 years ago Sep 2016 (crate updated 8 years ago) 540,063 N/A N/A
bloom-filters 7 4 6 years ago Jun 2021 (crate updated over 3 years ago) 203,647 292 Nervos Network

Notes:

  • fastbloom: A fast Bloom filter implemented in Rust, with Python bindings available. It supports both standard and counting Bloom filters.

  • bloom-rs: Provides standard and counting Bloom filters. Last updated in 2016, indicating potential lack of recent maintenance.

  • bloom-filters: A fast Bloom filter implementation in Rust, primarily maintained by the Nervos Network team.

Data is based on available information as of December 2024.

@josecelano
Copy link
Member Author

josecelano commented Dec 7, 2024

Hi @da2ce7 I think I'm going to implement it in two phases, first the IPs and then the subnets. I will probably use the bloom-filters crate.

NOTES/QUESTIONS:

  • I think it will be hard to find a good size for the subnet in the subnet filter. I guess you wanted to introduce this level to avoid DoS attacks and not only bad client implementations. Maybe you are assuming all those bad implementations are actually attacks. Is there a reason why an attacker could use many IPs in the same subnet? I'm trying to understand why this would be effective without banning many false positives.
  • Should we use socket addresses instead of IPs? I don't think so.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
…imit

If the client does not send the rigth conenction ID more than 10 times
it's banned. In this first implementation after sending 10 times a wrong
connection ID. They are only unabnned when the tracker is restarted.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
…imit

If the client does not send the rigth conenction ID more than 10 times
it's banned. In this first implementation after sending 10 times a wrong
connection ID. They are only unabnned when the tracker is restarted.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
…imit

If the client does not send the rigth conenction ID more than 10 times
it's banned. In this first implementation after sending 10 times a wrong
connection ID. They are only unbanned when the tracker is restarted.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
…limit

If the client does not send the rigth conenction ID more than 10 times
it's banned. In this first implementation after sending 10 times a wrong
connection ID. They are only unbanned when the tracker is restarted.
@josecelano josecelano linked a pull request Dec 9, 2024 that will close this issue
10 tasks
@josecelano
Copy link
Member Author

josecelano commented Dec 9, 2024

Hi @da2ce7, I've implemented the minimal solution here.

When should we unban an IP?

I think we can unban all IPs every 24 hours (cbf.clear();). We can wrap the filter with a type that resets the inner CBF (deletes it and creates a new one) every 24 hours. What do you think?

I have more questions in my previous comment ☝🏼.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
…tured

Instead of captured the mapped error in the caller function when the
error is already converted into a UDP error reponse.

This prevents from parsing the error message to filter the error we are
interesting in.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 9, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024
Running the cleaner check on each iteration decreased the UDP tracker
performance.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024
Running the cleaner check on each iteration decreased the UDP tracker
performance.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 10, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
…limit

If the client does not send the rigth conenction ID more than 10 times
it's banned. In this first implementation after sending 10 times a wrong
connection ID. They are only unbanned when the tracker is restarted.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
…tured

Instead of captured the mapped error in the caller function when the
error is already converted into a UDP error reponse.

This prevents from parsing the error message to filter the error we are
interesting in.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
Running the cleaner check on each iteration decreased the UDP tracker
performance.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 11, 2024
@josecelano
Copy link
Member Author

josecelano commented Dec 12, 2024

Today, we discussed this issue in our weekly meeting.

@da2ce7 said the false negatives rate is too high. @da2ce7 I forgot to mention that you can set the rate:

The frequency of false positives can be preciecly bounded by setting the size of the filter, and is called the False Positive Rate.

See https://docs.rs/bloom/latest/bloom/index.html#bloom-filters.

@da2ce7 also commented there is an open issue for a long time about the high False-positive rate:

We were also considering using another crate that I discarded because It does not have an implementation for a Counting Bloom Filter. The crate:

https://github.com/tomtomwombat/fastbloom

NOTE: It has the same name as the other one I was analyzing, but they are actually two different packages:

I've opened a new issue to ask them for their plans to add a Counting Bloom Filter feature.

We also discussed alternative implementations to remove false positives. I will describe the solution in a new comment.

@josecelano
Copy link
Member Author

josecelano commented Dec 12, 2024

1. Alternative Implementation: Two-Tiered Approach

The basic idea is to use the CBF just as a fast filter to detect potencial bad actors. If the counter for an IP goes over a threshold, we don't ban the IP directly. Instead, we add the IP to a reliable secondary list with a HashMap.

Counting Bloom Filter as a Fast Filter:

  • Use the CBF to estimate potential misbehaving IPs quickly.
  • Increment the counter in the CBF for each bad request from an IP.
  • When the counter for an IP exceeds 10 in the CBF, move that IP to a more reliable structure for precise counting.

Reliable Backend Structure:

  • Use a HashMap (or another reliable key-value store) for precise counting of IPs that are flagged as misbehaving by the CBF.
  • In the HashMap, count up to 10 precise errors and only ban the IP when the count reaches 10.
  • Once an IP is banned, you no longer need to query it in the HashMap (or the CBF), which helps reduce the overhead.

False Positive Handling:

  • False positives in the CBF will only lead to lookups in the HashMap but will not result in incorrect bans.
  • This ensures no false negatives because the actual banning decision is always based on the precise counts in the HashMap.

Unban Handling:

  • We can remove IPs from the HashMap periodically or after a period for that concrete IP. We can include a timestamp for when the ban started.

Pros

  • No false positives. No client is banned accidentally.

Cons

  • For potentially misbehaving IPS, we have to double check it, by accessing two data structures. I wonder if that wouldn't be more costly than just replying with the error message. In the end, we don't even need to get data from the main torrent repository, which is, I think, one of the main bottlenecks.
  • If there are many bad actors, that can lead to another type of attack: memory consumption. But that's a problem we have anyway for normal requests.

Questions

  • When should we clean the CBF? I think @da2ce7 proposed not to clean it because the bucket might contain more than one IP.
  • @da2ce7 I think this was not exactly your idea because you mentioned something about the IP hash. Could you correct this description of the implementation?

@josecelano
Copy link
Member Author

In the new implementation add a new metric to tracker stats for the number of banned IPs.

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
…limit

If the client does not send the rigth conenction ID more than 10 times
it's banned. In this first implementation after sending 10 times a wrong
connection ID. They are only unbanned when the tracker is restarted.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
…tured

Instead of captured the mapped error in the caller function when the
error is already converted into a UDP error reponse.

This prevents from parsing the error message to filter the error we are
interesting in.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
Running the cleaner check on each iteration decreased the UDP tracker
performance.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
We are using a Counting Bloom Filter to count IPs sending wrong
connections IDs. IPs are banned after sending 10 wrong connections IDs.

CBFs are fast and use litle memory but they are also innaccurate. They
have False Positives meaning some IPs would be banned only becuase there
are bucket colissions (IPs sharing the same counter).

To avoid banning IPs incorrectly we decided to introduce a second
counter, which is a HashMap counting error is a exact way. IPs are only
banned when this counter reaches the limit.

We keep the CBF as a first level filter. It's a fast check to filter IPs
without affecting tracker's performance. When the IP is banned according
tho the first filter we start a counter for that IP in the second
exact counter.

This solution should be good if the number of IPs is low. We have to
find another solution anyway for IPv6 where is cheaper to own a range if
IPs.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
Since the new solution with a HashMap consumes more memory, we should
keep hte banning list short. The drawback is clients will be allowed to
send more wrong connections IDs. However, sending 10 requests with wrong
connection IDs every 2 minutos should not affect much the performance,
unless we have many IPs, and in that case we would have a problem with
memory anyway.

In the future Sys Admin could inject this via a setting value.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
Since the new solution with a HashMap consumes more memory, we should
keep hte banning list short. The drawback is clients will be allowed to
send more wrong connections IDs. However, sending 10 requests with wrong
connection IDs every 2 minutos should not affect much the performance,
unless we have many IPs, and in that case we would have a problem with
memory anyway.

In the future Sys Admin could inject this via a setting value.
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Dec 16, 2024
…limit

The life demo tracker is receiving many UDP requests with a wrong conenctions IDs. Errors are logged (write disk) and that
decreases the tracker performance.

This counts errors and bans Ips after 10 errors for 2 minutes.

We use two levels of counters.

1. First level: A Counting Bloom Filter: fast and low memory consumption
   but innacurate (False Positives).
2. HashMap: Exact Counter for Ips.

CBFs are fast and use litle memory but they are also innaccurate. They
have False Positives meaning some IPs would be banned only becuase there
are bucket colissions (IPs sharing the same counter).

To avoid banning IPs incorrectly we decided to introduce a second
counter, which is a HashMap that counts error precisely. IPs are only
banned when this counter reaches the limit (over 10 errors).

We keep the CBF as a first level filter. It's a fast-check IP filter
without affecting tracker's performance. When the IP is banned according
to the first filter we double-check in the HashMap.

CBF is faster than checking always for banned IPs against the HashMap.

This solution should be good if the number of IPs is low. We have to
find another solution anyway for IPv6 where is cheaper to own a range of
IPs.
josecelano added a commit that referenced this issue Dec 17, 2024
29e506d feat: use default aquatic udp port for benchmarking (Jose Celano)
10f9bda feat: [#1096] ban client IP when exceeds connection ID errors limit (Jose Celano)
87401e8 chore(deps): add dependency bloom (Jose Celano)

Pull request description:

  This PR uses a [Counting Bloom Filter](https://docs.rs/bloom/latest/bloom/#counting-bloom-filters) to count IP sending UDP requests with wrong connection IDs.

  The IP is banned when the tracker receives more than 10 requests from a given IP with a bad connection ID. Bad connection IDs are cookie values that have expired or are from the future.

  With the current `CountingBloomFilter` configuration (0.01 rate), we would have a **False Positive** for every 10000 IPs, meaning when two IPs have a collision, and one of them is misbehaving, the other one would also be banned.

  To avoid false positives, we introduced a second counter with a HashMap. This consumes more memory, but it's reset every 120 seconds. The HashMap is only used when the CBF detects a potential bad client.

  ### TODO

  - [x] Straightforward implementation
  - [x] Benchmarking (how much this new feature affects performance)
  - [x] Add an E2E test
  - [x] Remove IPs from the banned list every hour
  - [x] Review filter settings `CountingBloomFilter::with_rate(4, 0.01, 100)`
  - [x] Refactor: extract the IP ban service from the main loop
  - [x] Benchmarking after extracting `BanService`

  ### Questions

  - [ ] Should we add a configuration option for the maximum number of errors allowed?

  ### Future PR

  - [ ] Add a metric to tracker stats for the number of banned IPs.
  - [ ] Ban subnets

ACKs for top commit:
  josecelano:
    ACK 29e506d

Tree-SHA512: 004959e00eced1b9c1de39de81f8f9f1d8da1b46f5ee38b3b0679e77cc40448525ac197145ace5dd62017c39a72f7175b06f556e6a7eb8cffbdc57f67052a856
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- Admin - Enjoyable to Install and Setup our Software Enhancement / Feature Request Something New Optimization Make it Faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant