Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter IPv6 addresses #4

Open
bjeanes opened this issue Dec 17, 2018 · 8 comments
Open

Filter IPv6 addresses #4

bjeanes opened this issue Dec 17, 2018 · 8 comments

Comments

@bjeanes
Copy link

bjeanes commented Dec 17, 2018

I am looking through the code and noticed the regex only seems to match IPv4 addresses.

Should IPv6 also be matched?

I'm hoping to do this so that I can add a custom scrubber which used IpAnonymizer and lean on the gem-maintained Regex.

Thoughts?

@bjeanes
Copy link
Author

bjeanes commented Dec 17, 2018

Thinking on this more, could the IP regex be delegated to stdlib's Resolv::AddressRegex?

[7] pry(main)> Resolv::AddressRegex
=> /(?:(?-mix:\A((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\.((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\.((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\.((?x-mi:0
               |1(?:[0-9][0-9]?)?
               |2(?:[0-4][0-9]?|5[0-5]?|[6-9])?
               |[3-9][0-9]?))\z))|(?:(?x-mi:
      (?:(?x-mi:\A
      (?:[0-9A-Fa-f]{1,4}:){7}
         [0-9A-Fa-f]{1,4}
      \z)) |
      (?:(?x-mi:\A
      ((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::
      ((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)
      \z)) |
      (?:(?x-mi:\A
      ((?:[0-9A-Fa-f]{1,4}:){6,6})
      (\d+)\.(\d+)\.(\d+)\.(\d+)
      \z)) |
      (?:(?x-mi:\A
      ((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::
      ((?:[0-9A-Fa-f]{1,4}:)*)
      (\d+)\.(\d+)\.(\d+)\.(\d+)
      \z))))/

EDIT: Yes, there is a good reason. The regex is anchored so wouldn't match anything in a logging context.

@ankane
Copy link
Owner

ankane commented Dec 17, 2018

Hey @bjeanes, great suggestion. I created an ipv6 branch for this. Unfortunately, there's a pretty big performance hit (slows down log throughput by around 33%), so need to decide what to do. Also need to make it work with url encoding. Anyways, should should be able to use the regexp from the branch for your use case for now.

https://github.com/ankane/logstop/compare/ipv6

@bjeanes
Copy link
Author

bjeanes commented Dec 17, 2018

Hah!

Regexp.new(Resolv::AddressRegex.source.gsub('\A', '\b').gsub('\z', '\b'))

^^ that is exactly what I have just deployed to production already!

@bjeanes
Copy link
Author

bjeanes commented Dec 17, 2018

slows down log throughput by around 33%

I've deployed this and the slowdown is noticeable in web RPMs too. 🤔

so need to decide what to do

I run on Heroku so my logs go via STDOUT. I do wonder if I'm better off with a solution that filters the logs outside of the core Ruby process, even if backed by this gem. That should, at least in theory, allow for some better use of multiple cores...

@ankane
Copy link
Owner

ankane commented Dec 18, 2018

I'm surprised it's noticeable at the RPM level. What difference are you seeing there?

@bjeanes
Copy link
Author

bjeanes commented Dec 18, 2018 via email

@ankane
Copy link
Owner

ankane commented Jan 3, 2020

Just ran benchmarks with Ruby 2.7 and latest code:

Warming up --------------------------------------
             no ipv6     3.126k i/100ms
                ipv6     1.938k i/100ms
Calculating -------------------------------------
             no ipv6     34.599k (± 3.5%) i/s -    175.056k in   5.065833s
                ipv6     19.874k (± 3.4%) i/s -    100.776k in   5.076570s

It still appears to reduce throughput significantly, but 20k iterations per second is still pretty fast and most of the time spent in an application is not in logging. Will plan to merge once IPv6 has more adoption.

Edit: another approach could be to use a less complex regex if common sources of IPs (like Rack::Request) use a specific format.

@bjeanes
Copy link
Author

bjeanes commented Jan 5, 2020

Edit: another approach could be to use a less complex regex if common sources of IPs (like Rack::Request) use a specific format.

Yeah, or use a less complex regex regardless. It probably is better (from GDPR etc standpoint) to accidentally filter non-IPs but catch all actual IPs than to only catch IPs but take a large hit in performance. This could even be a configuration option, that swaps in a more accurate regex when the user opts into the trade-off of slower perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants