Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2nd Iter: Ensure that hostnames without dots are excluded. #2400

Merged
merged 19 commits into from
Sep 6, 2023

Conversation

funilrys
Copy link
Contributor

@funilrys funilrys commented Aug 5, 2023

This patch fixes #2347.
This patch touches #2391.

Indeed, my previous patch was missing domains with dashes (-).

This patch fixes StevenBlack#2347.

Indeed, my previous patch was missing domains with dashes (-).
@funilrys funilrys marked this pull request as draft August 5, 2023 18:53
@funilrys
Copy link
Contributor Author

funilrys commented Aug 5, 2023

Converting to draft as I caught more errors.

Indeed, even with the original regex, the following test doesn't
pass. Therefore, it is necessary to add it into the tests.

	www.example-3045.foobar.com
@funilrys
Copy link
Contributor Author

funilrys commented Aug 6, 2023

Hey Steve @StevenBlack! Hope you are doing well.

I have a question because I caught some "discrepancies" while trying to implement this right.

The question: What is your position on IPs?

I'm asking because, on one side, we have:

for rule in [
"128.0.0.1",
"0.0.0 google",
"0.1.2.3.4 foo/bar",
"0.0.0.0 https"
]:
self.assertEqual(normalize_rule(rule, **kwargs), (None, None))
output = sys.stdout.getvalue()
sys.stdout = StringIO()
expected = "==>" + rule + "<=="
self.assertIn(expected, output)

which says: we don't want any "raw" IP.

On the other side, we have:

for target_ip in ("0.0.0.0", "127.0.0.1", "8.8.8.8"):
rule = "127.0.0.1 11.22.33.44 foo"
expected = ("11.22.33.44", str(target_ip) + " 11.22.33.44\n")
actual = normalize_rule(
rule, target_ip=target_ip, keep_domain_comments=False
)
self.assertEqual(actual, expected)
# Nothing gets printed if there's a match.
output = sys.stdout.getvalue()
self.assertEqual(output, "")
sys.stdout = StringIO()

which says: if an IP is given as part of a host entry, keep it.

Finally, on one other side, we have:

hosts/updateHostsFile.py

Lines 1091 to 1092 in 8695605

# deny any potential IPv6 address here.
if ":" not in rule:

which says: just ignore IPv6.

That's okay, but we both know, it doesn't make sense at all to have 2 IPv4s in a host's file entry.

The reason I'm asking is while analyzing the outputs of the coming commits, I discovered, that we have to change the way we catch domains. Indeed, the only records that were not being scrapped correctly, were those where the Punycode of the TLD has a number and dashes in it. That means that to stay as correct as possible, we have to relax the regex - somehow.
While I did so locally, I now have the problem that IPs pass the regex which was meant for domains. So now I have to change not only the regex but some parts of the code.

Therefore, we need to decide what we want.

.... To keep (IPv4s) or not to keep ? That is the question! 😅

Have a nice day/night!
Nissar

@StevenBlack
Copy link
Owner

Thank you Nissar @funilrys I'll look into the history of that test code now. I agree it doesn't seem to make sense.


# WARNING:
# [a-zA-Z0-9\-]+ is NOT an issue. (e.g., xn--p1ai TLD - and others).
regex = r"^\s*(\d{1,3}\.){3}\d{1,3}\s+((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*)"

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings containing many repetitions of '..'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex = r"^\s*([\w\.-]+[a-zA-Z])(.*)"
# WARNING:
# [a-zA-Z0-9\-]+ is NOT an issue. (e.g., xn--p1ai TLD - and others).
regex = r"^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*)"

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings containing many repetitions of '..'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay @namka279 , please look at the actual code and what has been merged ... This commit didn't land into production ... You are too late to the party ...

@StevenBlack Please close #2634 ...

@StevenBlack
Copy link
Owner

Nissar @funilrys I'm seeing a failing test here. I'm still puzzled over the test — what is it actually testing? — it appears to be a test that combines the --ip flag with an invalid domain (an ip-address) for input?

Here's what I'm seeing

 python3 testUpdateHostsFile.py
...........................................F.........................................................................
======================================================================
FAIL: test_no_match (__main__.TestNormalizeRule)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/Steve/Dropbox/dev/hosts/testUpdateHostsFile.py", line 850, in test_no_match
    self.assertEqual(normalize_rule(rule, **kwargs), (None, None))
AssertionError: Tuples differ: ('128.0.0.1', '0.0.0.0 128.0.0.1\n') != (None, None)

First differing element 0:
'128.0.0.1'
None

- ('128.0.0.1', '0.0.0.0 128.0.0.1\n')
+ (None, None)

----------------------------------------------------------------------
Ran 117 tests in 0.091s

FAILED (failures=1)

ALSO check out some of the warnings above ⬆️ that read like this:

2023-08-11_21-50-10

It feels like we are thisclose 🤏🏻 but still not right.

I've taken several stabs at this regex in the past. I'm currently thinking, regex may not be the best tool anymore at this juncture.

@buzzingwires
Copy link

buzzingwires commented Sep 2, 2023

So, as I understand, 128.0.0.1 gets matched by ^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*), which checks for raw domains not preceded by a destination. It boils down to the question of whether we want to allow IP addresses in the host field, as mentioned by @funilrys

In my opinion, I'm inclined to say no, since mapping IP addresses to other IP addresses is not something the hosts file should be doing.

As for the inefficient regex, I do believe many parsing issues are prevented by not using regexes, but a couple potential optimizations are:

^\s*(\d{1,3}\.){3}\d{1,3}\s+((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*) -> ^\s*\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s+([\w\-\.]+\.[a-zA-Z0-9\-]+)(.*)

Writing out each octet of the IP address seems to reduce the number of steps, at the cost of verbosity. I also removed the first match group (\d{1,3}\.){3}, since it didn't seem to be used? I have a strong hunch that (?:[\w\-\.]+\.)+ is what could potentially be causing the exponential backtracking issue. At the very least, removing the non-capturing group seems to reduce the number of steps. My rationale for safely removing it is that as long as the sequence ends with ., preceeding groups should be matched by [\w\-\.]+. The effect should be the same as explicitly checking for . at the end of each group.

^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*) -> ^\s*([\w\-\.]+\.[a-zA-Z0-9\-]+)(.*)

follows the same ideas. With the 128.0.0.1 test removed, the others fail because the (.*) at the end matches the space character and everything after it. Otherwise, might be able to fix this with a lookahead assertion? Assertions could also be used to block IP addresses, if/when relevant.

The output of the hosts file is unchanged, though, save for something that actually appears to be fixed. 0.0.0.0 pgl.example #0101 -> 0.0.0.0 pgl.example0101

This "regexless" alternative just try to implement a more
"generic" solution to the actual "problem".

Please note that this commit will fail tests, because this commit
assume that IPs are not correct rules.

Please also note that the following test will also fail because
the new implementation assume that is actually a parsable
rule. @StevenBlack need to take a decision regarding that one rule.

	0.0.0 google

Also:
  * My editor "blacked" the file.
@funilrys
Copy link
Contributor Author

funilrys commented Sep 3, 2023

Hey @StevenBlack ,

I implemented a "regexless" generic solution that should solve all our problems.
Please look at my last commit and proceed with the tests.

My last commit assumes we want to get rid of IPs as part of the rule. Therefore, I will need a decision regarding the failing test rules.

Stay safe and healthy!
Nissar

- Anything that looks like an IP will be ignored.
- Anything that doesn't containt dots will be ignored.
Indeed, from on:
  1. We strip out IPs.
  2. We strip out "potential" INVALID that:
     - doesn't contains dots
     - contains at least 2 consecutive dots
     - looks like an IP.

From now on an acceptable subject shall:
  1. have at least 1 dot.
  2. NOT be an IPv4 or IPv6
  3. NOT look like an IP. (Example: 258.300.10.3)
@funilrys
Copy link
Contributor Author

funilrys commented Sep 6, 2023

Following Steve's @StevenBlack comment (CF: funilrys@b3f93f1#commitcomment-126334510), I submitted the final changes.

Please review.

Stay safe & healthy!

@funilrys funilrys marked this pull request as ready for review September 6, 2023 20:01
@StevenBlack
Copy link
Owner

Nissar THANK YOU SO MUCH ❤️

Tests work great, updating works great. This is awesome!

MERGING!

@StevenBlack StevenBlack merged commit 7ad2811 into StevenBlack:master Sep 6, 2023
18 checks passed
@StevenBlack
Copy link
Owner

Thank you for your input as well @buzzingwires that was very valuable too!

@StevenBlack
Copy link
Owner

Nissar @funilrys something I just noticed: domains are no longer all converted to lowercase.

For example, in the base hosts file line 219361: PopCash.com

Also the list jumped by 10k domains with this new version, which is about 8,000 more than I expected.

funilrys added a commit to funilrys/hosts that referenced this pull request Sep 7, 2023
As mentioned by @StevenBlack in StevenBlack#2400, hostnames should be
converted to lowercase.
Repository owner deleted a comment from namka279 Apr 19, 2024
Repository owner deleted a comment from namka279 Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid domain
5 participants