2nd Iter: Ensure that hostnames without dots are excluded. #2400

funilrys · 2023-08-05T18:49:13Z

This patch fixes #2347.
This patch touches #2391.

Indeed, my previous patch was missing domains with dashes (-).

This patch fixes StevenBlack#2347. Indeed, my previous patch was missing domains with dashes (-).

funilrys · 2023-08-05T18:53:27Z

Converting to draft as I caught more errors.

Indeed, even with the original regex, the following test doesn't pass. Therefore, it is necessary to add it into the tests. www.example-3045.foobar.com

funilrys · 2023-08-06T15:19:06Z

Hey Steve @StevenBlack! Hope you are doing well.

I have a question because I caught some "discrepancies" while trying to implement this right.

The question: What is your position on IPs?

I'm asking because, on one side, we have:

hosts/testUpdateHostsFile.py

Lines 841 to 853 in 8695605

    
           for rule in [ 
        
               "128.0.0.1", 
        
               "0.0.0 google", 
        
               "0.1.2.3.4 foo/bar", 
        
               "0.0.0.0 https" 
        
           ]: 
        
               self.assertEqual(normalize_rule(rule, **kwargs), (None, None)) 
        
               output = sys.stdout.getvalue() 
        
               sys.stdout = StringIO() 
        
               expected = "==>" + rule + "<==" 
        
               self.assertIn(expected, output)

which says: we don't want any "raw" IP.

On the other side, we have:

hosts/testUpdateHostsFile.py

Lines 892 to 905 in 8695605

    
           for target_ip in ("0.0.0.0", "127.0.0.1", "8.8.8.8"): 
        
               rule = "127.0.0.1 11.22.33.44 foo" 
        
               expected = ("11.22.33.44", str(target_ip) + " 11.22.33.44\n") 
        
               actual = normalize_rule( 
        
                   rule, target_ip=target_ip, keep_domain_comments=False 
        
               ) 
        
               self.assertEqual(actual, expected) 
        
               # Nothing gets printed if there's a match. 
        
               output = sys.stdout.getvalue() 
        
               self.assertEqual(output, "") 
        
               sys.stdout = StringIO()

which says: if an IP is given as part of a host entry, keep it.

Finally, on one other side, we have:

hosts/updateHostsFile.py

Lines 1091 to 1092 in 8695605

    
           # deny any potential IPv6 address here. 
        
           if ":" not in rule:

which says: just ignore IPv6.

That's okay, but we both know, it doesn't make sense at all to have 2 IPv4s in a host's file entry.

The reason I'm asking is while analyzing the outputs of the coming commits, I discovered, that we have to change the way we catch domains. Indeed, the only records that were not being scrapped correctly, were those where the Punycode of the TLD has a number and dashes in it. That means that to stay as correct as possible, we have to relax the regex - somehow.
While I did so locally, I now have the problem that IPs pass the regex which was meant for domains. So now I have to change not only the regex but some parts of the code.

Therefore, we need to decide what we want.

.... To keep (IPv4s) or not to keep ? That is the question! 😅

Have a nice day/night!
Nissar

StevenBlack · 2023-08-08T15:04:01Z

Thank you Nissar @funilrys I'll look into the history of that test code now. I agree it doesn't seem to make sense.

Indeed, before this patch, we were not supporting TLD which contains digits and dashes (-) when "puny-encoded".

namka279 · 2024-04-19T17:19:07Z

updateHostsFile.py

+
+    # WARNING:
+    #   [a-zA-Z0-9\-]+ is NOT an issue. (e.g., xn--p1ai TLD - and others).
+    regex = r"^\s*(\d{1,3}\.){3}\d{1,3}\s+((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*)"


namka279 · 2024-04-19T17:19:48Z

updateHostsFile.py

-        regex = r"^\s*([\w\.-]+[a-zA-Z])(.*)"
+        # WARNING:
+        #   [a-zA-Z0-9\-]+ is NOT an issue. (e.g., xn--p1ai TLD - and others).
+        regex = r"^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*)"


Okay @namka279 , please look at the actual code and what has been merged ... This commit didn't land into production ... You are too late to the party ...

@StevenBlack Please close #2634 ...

StevenBlack · 2023-08-12T01:53:12Z

Nissar @funilrys I'm seeing a failing test here. I'm still puzzled over the test — what is it actually testing? — it appears to be a test that combines the --ip flag with an invalid domain (an ip-address) for input?

Here's what I'm seeing

 python3 testUpdateHostsFile.py
...........................................F.........................................................................
======================================================================
FAIL: test_no_match (__main__.TestNormalizeRule)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/Steve/Dropbox/dev/hosts/testUpdateHostsFile.py", line 850, in test_no_match
    self.assertEqual(normalize_rule(rule, **kwargs), (None, None))
AssertionError: Tuples differ: ('128.0.0.1', '0.0.0.0 128.0.0.1\n') != (None, None)

First differing element 0:
'128.0.0.1'
None

- ('128.0.0.1', '0.0.0.0 128.0.0.1\n')
+ (None, None)

----------------------------------------------------------------------
Ran 117 tests in 0.091s

FAILED (failures=1)

ALSO check out some of the warnings above ⬆️ that read like this:

It feels like we are thisclose 🤏🏻 but still not right.

I've taken several stabs at this regex in the past. I'm currently thinking, regex may not be the best tool anymore at this juncture.

buzzingwires · 2023-09-02T09:13:36Z

So, as I understand, 128.0.0.1 gets matched by ^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*), which checks for raw domains not preceded by a destination. It boils down to the question of whether we want to allow IP addresses in the host field, as mentioned by @funilrys

In my opinion, I'm inclined to say no, since mapping IP addresses to other IP addresses is not something the hosts file should be doing.

As for the inefficient regex, I do believe many parsing issues are prevented by not using regexes, but a couple potential optimizations are:

^\s*(\d{1,3}\.){3}\d{1,3}\s+((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*) -> ^\s*\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s+([\w\-\.]+\.[a-zA-Z0-9\-]+)(.*)

Writing out each octet of the IP address seems to reduce the number of steps, at the cost of verbosity. I also removed the first match group (\d{1,3}\.){3}, since it didn't seem to be used? I have a strong hunch that (?:[\w\-\.]+\.)+ is what could potentially be causing the exponential backtracking issue. At the very least, removing the non-capturing group seems to reduce the number of steps. My rationale for safely removing it is that as long as the sequence ends with ., preceeding groups should be matched by [\w\-\.]+. The effect should be the same as explicitly checking for . at the end of each group.

^\s*((?:[\w\-\.]+\.)+[a-zA-Z0-9\-]+)(.*) -> ^\s*([\w\-\.]+\.[a-zA-Z0-9\-]+)(.*)

follows the same ideas. With the 128.0.0.1 test removed, the others fail because the (.*) at the end matches the space character and everything after it. Otherwise, might be able to fix this with a lookahead assertion? Assertions could also be used to block IP addresses, if/when relevant.

The output of the hosts file is unchanged, though, save for something that actually appears to be fixed. 0.0.0.0 pgl.example #0101 -> 0.0.0.0 pgl.example0101

@StevenBlack

This "regexless" alternative just try to implement a more "generic" solution to the actual "problem". Please note that this commit will fail tests, because this commit assume that IPs are not correct rules. Please also note that the following test will also fail because the new implementation assume that is actually a parsable rule. @StevenBlack need to take a decision regarding that one rule. 0.0.0 google Also: * My editor "blacked" the file.

funilrys · 2023-09-03T09:30:13Z

Hey @StevenBlack ,

I implemented a "regexless" generic solution that should solve all our problems.
Please look at my last commit and proceed with the tests.

My last commit assumes we want to get rid of IPs as part of the rule. Therefore, I will need a decision regarding the failing test rules.

Stay safe and healthy!
Nissar

- Anything that looks like an IP will be ignored. - Anything that doesn't containt dots will be ignored.

Indeed, from on: 1. We strip out IPs. 2. We strip out "potential" INVALID that: - doesn't contains dots - contains at least 2 consecutive dots - looks like an IP. From now on an acceptable subject shall: 1. have at least 1 dot. 2. NOT be an IPv4 or IPv6 3. NOT look like an IP. (Example: 258.300.10.3)

funilrys · 2023-09-06T20:01:56Z

Following Steve's @StevenBlack comment (CF: funilrys@b3f93f1#commitcomment-126334510), I submitted the final changes.

Please review.

Stay safe & healthy!

StevenBlack · 2023-09-06T21:25:04Z

Nissar THANK YOU SO MUCH ❤️

Tests work great, updating works great. This is awesome!

MERGING!

StevenBlack · 2023-09-06T21:26:03Z

Thank you for your input as well @buzzingwires that was very valuable too!

StevenBlack · 2023-09-06T22:06:28Z

Nissar @funilrys something I just noticed: domains are no longer all converted to lowercase.

For example, in the base hosts file line 219361: PopCash.com

Also the list jumped by 10k domains with this new version, which is about 8,000 more than I expected.

@StevenBlack

As mentioned by @StevenBlack in StevenBlack#2400, hostnames should be converted to lowercase.

2nd Iter: Ensure that hostnames without dots are excluded.

1f3f323

This patch fixes StevenBlack#2347. Indeed, my previous patch was missing domains with dashes (-).

funilrys marked this pull request as draft August 5, 2023 18:53

Add new test case.

3b2c9dd

Indeed, even with the original regex, the following test doesn't pass. Therefore, it is necessary to add it into the tests. www.example-3045.foobar.com

StevenBlack and others added 9 commits August 10, 2023 16:35

Updates from URLHaus and KADhosts.

b52e6db

Update graph.

df85659

Release 3.13.19

0bf24bb

Issue StevenBlack#2398: add ms4.applvn.com.

369b483

Issue StevenBlack#2402: add five new domains.

f0b6120

Updates from Sinfonietta, URLHaus, and KADhosts.

6164086

Release 3.13.20

4574a4a

Add s2.bl-1.com to hosts file

2415078

Add support for exotic TLDs.

560615d

Indeed, before this patch, we were not supporting TLD which contains digits and dashes (-) when "puny-encoded".

github-advanced-security bot found potential problems Aug 10, 2023

View reviewed changes

funilrys added 7 commits September 3, 2023 11:31

Merge remote-tracking branch 'upstream/master' into issue-2347

15da4c8

Remove any potential invalid entries.

8fc7f94

- Anything that looks like an IP will be ignored. - Anything that doesn't containt dots will be ignored.

Fix linting issue-s

33bd0df

Fix minor logical issue.

fed5590

Fix logical issue.

2726591

Merge remote-tracking branch 'upstream/master' into issue-2347

6a3a329

funilrys marked this pull request as ready for review September 6, 2023 20:01

StevenBlack merged commit 7ad2811 into StevenBlack:master Sep 6, 2023

funilrys added a commit to funilrys/hosts that referenced this pull request Sep 7, 2023

Correct 2nd Iter: Hostnames to lowercase.

3b4709f

As mentioned by @StevenBlack in StevenBlack#2400, hostnames should be converted to lowercase.

funilrys mentioned this pull request Sep 7, 2023

Correct 2nd Iter: Hostnames to lowercase. #2433

Merged

Repository owner deleted a comment from namka279 Apr 19, 2024

Uh oh!

2nd Iter: Ensure that hostnames without dots are excluded. #2400

2nd Iter: Ensure that hostnames without dots are excluded. #2400

Uh oh!

Conversation

funilrys commented Aug 5, 2023

Uh oh!

funilrys commented Aug 5, 2023

Uh oh!

funilrys commented Aug 6, 2023

Uh oh!

StevenBlack commented Aug 8, 2023

Uh oh!

Check failure

namka279 Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

namka279 Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

Check failure

namka279 Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

funilrys Apr 20, 2024

Choose a reason for hiding this comment

Uh oh!

StevenBlack commented Aug 12, 2023

Uh oh!

buzzingwires commented Sep 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

funilrys commented Sep 3, 2023

Uh oh!

funilrys commented Sep 6, 2023

Uh oh!

StevenBlack commented Sep 6, 2023

Uh oh!

StevenBlack commented Sep 6, 2023

Uh oh!

StevenBlack commented Sep 6, 2023

Uh oh!

Uh oh!

buzzingwires commented Sep 2, 2023 •

edited

Loading