-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2nd Iter: Ensure that hostnames without dots are excluded. #2400
Conversation
This patch fixes StevenBlack#2347. Indeed, my previous patch was missing domains with dashes (-).
Converting to draft as I caught more errors. |
Indeed, even with the original regex, the following test doesn't pass. Therefore, it is necessary to add it into the tests. www.example-3045.foobar.com
Hey Steve @StevenBlack! Hope you are doing well. I have a question because I caught some "discrepancies" while trying to implement this right. The question: What is your position on IPs? I'm asking because, on one side, we have: Lines 841 to 853 in 8695605
which says: we don't want any "raw" IP. On the other side, we have: Lines 892 to 905 in 8695605
which says: if an IP is given as part of a host entry, keep it. Finally, on one other side, we have: Lines 1091 to 1092 in 8695605
which says: just ignore IPv6. That's okay, but we both know, it doesn't make sense at all to have 2 IPv4s in a host's file entry. The reason I'm asking is while analyzing the outputs of the coming commits, I discovered, that we have to change the way we catch domains. Indeed, the only records that were not being scrapped correctly, were those where the Punycode of the TLD has a number and dashes in it. That means that to stay as correct as possible, we have to relax the regex - somehow. Therefore, we need to decide what we want. .... To keep (IPv4s) or not to keep ? That is the question! 😅 Have a nice day/night! |
Thank you Nissar @funilrys I'll look into the history of that test code now. I agree it doesn't seem to make sense. |
Indeed, before this patch, we were not supporting TLD which contains digits and dashes (-) when "puny-encoded".
updateHostsFile.py
Fixed
|
||
# WARNING: | ||
# [a-zA-Z0-9\-]+ is NOT an issue. (e.g., xn--p1ai TLD - and others). | ||
regex = r"^\s*(\d{1,3}\.){3}\d{1,3}\s+((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*)" |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateHostsFile.py
Fixed
regex = r"^\s*([\w\.-]+[a-zA-Z])(.*)" | ||
# WARNING: | ||
# [a-zA-Z0-9\-]+ is NOT an issue. (e.g., xn--p1ai TLD - and others). | ||
regex = r"^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*)" |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay @namka279 , please look at the actual code and what has been merged ... This commit didn't land into production ... You are too late to the party ...
@StevenBlack Please close #2634 ...
Nissar @funilrys I'm seeing a failing test here. I'm still puzzled over the test — what is it actually testing? — it appears to be a test that combines the Here's what I'm seeing python3 testUpdateHostsFile.py
...........................................F.........................................................................
======================================================================
FAIL: test_no_match (__main__.TestNormalizeRule)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/Steve/Dropbox/dev/hosts/testUpdateHostsFile.py", line 850, in test_no_match
self.assertEqual(normalize_rule(rule, **kwargs), (None, None))
AssertionError: Tuples differ: ('128.0.0.1', '0.0.0.0 128.0.0.1\n') != (None, None)
First differing element 0:
'128.0.0.1'
None
- ('128.0.0.1', '0.0.0.0 128.0.0.1\n')
+ (None, None)
----------------------------------------------------------------------
Ran 117 tests in 0.091s
FAILED (failures=1) ALSO check out some of the warnings above ⬆️ that read like this: It feels like we are thisclose 🤏🏻 but still not right. I've taken several stabs at this regex in the past. I'm currently thinking, regex may not be the best tool anymore at this juncture. |
So, as I understand, In my opinion, I'm inclined to say no, since mapping IP addresses to other IP addresses is not something the hosts file should be doing. As for the inefficient regex, I do believe many parsing issues are prevented by not using regexes, but a couple potential optimizations are:
Writing out each octet of the IP address seems to reduce the number of steps, at the cost of verbosity. I also removed the first match group
follows the same ideas. With the The output of the hosts file is unchanged, though, save for something that actually appears to be fixed. |
This "regexless" alternative just try to implement a more "generic" solution to the actual "problem". Please note that this commit will fail tests, because this commit assume that IPs are not correct rules. Please also note that the following test will also fail because the new implementation assume that is actually a parsable rule. @StevenBlack need to take a decision regarding that one rule. 0.0.0 google Also: * My editor "blacked" the file.
Hey @StevenBlack , I implemented a "regexless" generic solution that should solve all our problems. My last commit assumes we want to get rid of IPs as part of the rule. Therefore, I will need a decision regarding the failing test rules. Stay safe and healthy! |
- Anything that looks like an IP will be ignored. - Anything that doesn't containt dots will be ignored.
Indeed, from on: 1. We strip out IPs. 2. We strip out "potential" INVALID that: - doesn't contains dots - contains at least 2 consecutive dots - looks like an IP. From now on an acceptable subject shall: 1. have at least 1 dot. 2. NOT be an IPv4 or IPv6 3. NOT look like an IP. (Example: 258.300.10.3)
Following Steve's @StevenBlack comment (CF: funilrys@b3f93f1#commitcomment-126334510), I submitted the final changes. Please review. Stay safe & healthy! |
Nissar THANK YOU SO MUCH ❤️ Tests work great, updating works great. This is awesome! MERGING! |
Thank you for your input as well @buzzingwires that was very valuable too! |
Nissar @funilrys something I just noticed: domains are no longer all converted to lowercase. For example, in the base hosts file line Also the list jumped by 10k domains with this new version, which is about 8,000 more than I expected. |
As mentioned by @StevenBlack in StevenBlack#2400, hostnames should be converted to lowercase.
This patch fixes #2347.
This patch touches #2391.
Indeed, my previous patch was missing domains with dashes (-).