Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid output #3

Open
smed79 opened this issue Apr 28, 2023 · 9 comments
Open

Invalid output #3

smed79 opened this issue Apr 28, 2023 · 9 comments

Comments

@smed79
Copy link

smed79 commented Apr 28, 2023

Testing the below list

||com/*?adver=123.456
||aaa.*/ads1/
||aaa.*/ads2/
||bbb.com^*ads3
||ccc.com^*ads4
||ddd.com*.ads.com
||eee.com/*$image,domain=fff.com
||ggg.hhh-*
||ggg.hhh-*
||ggg.hhh*iii-jjj
|http://kkk.com/ads/*
|https://kkk.lll/*
|https://ads.mmm.nnn^
||ooo.com/*.ppp
||qqq.com/img1*.ads8
||qqq.com/img2/*.ads9
||qqq.com/img3/.*./ads0
||qqq.com/img4/*.*/
||qqq.com/img5/*..*/
||rrr.com/*.php?123
||sss.com^*/img/
||ttt.com/*/ban.js
||uuu.com/*$script

Output

?adver=123.456
aaa.
aaa.
bbb.com
ccc.com
.ads.com
ddd.com
eee.com
ggg.hhh-
ggg.hhh-
ggg.hhh
.ppp
ooo.com
.ads8
qqq.com
.ads9
qqq.com
.
qqq.com
.
qqq.com
..
qqq.com
.php
rrr.com
sss.com
ttt.com
uuu.com

The expected output should be

kkk.lll
ads.mmm.nnn
@smed79 smed79 changed the title Invalid input Invalid output Apr 28, 2023
funilrys added a commit to funilrys/PyFunceble that referenced this issue May 1, 2023
Indeed, before this patch we weren't decoding the following cases:

  * |http://example.com/*
  * |http://example.org^

This patch fixes PyFunceble/adblock-decoder#3.

Contributors:
  * @smed79
@funilrys
Copy link
Member

funilrys commented May 1, 2023

@smed79 please review the testcases before I deploy/release my change: funilrys/PyFunceble@d32914b#diff-6fbb548d14d904b48cdaa09ea8c1ca04249d69cef0763217ac957605c50548a6R278-R326

Let me know If I missed a test case.

Stay safe and healthy!
Thank you for your patience.

@smed79
Copy link
Author

smed79 commented May 3, 2023

1st,

  • |http://example.com,https://example.de$script,image,domain=example.org|foo.example.net

There is no such cas in adblock (plus) syntax.

blocked requests (files or domains) cannot be separated by comma, so the correct syntax have only to be

  • |http://example.com$script,image,domain=example.org|foo.example.net
  • |https://example.de$script,image,domain=example.org|foo.example.net

or

  • ||example.com$script,image,domain=example.org|foo.example.net
  • ||example.de$script,image,domain=example.org|foo.example.net

or we have to use a regex rule, as below

  • /^https?:\/\/(example\.com|example\.de)\//$script,image,domain=example.org|foo.example.net

2nd,

Excuse my ignorance, i have a question ...

A set of tools for the decoding and conversion of AdBlock and filter lists.
(https://github.com/PyFunceble/adblock-decoder#adblock-filter-list-decoder)

what is the intended behavior ?

extracting all domains for testing purpose (ACTIVE, INACTIVE or INVALID) <-- Case 1

or

extracting only domains that are safe to be blocked ? <-- Case 2

for the second case (safe), the tool have to extract only domains that flagged with the third party option or limited with the symbol ^ at the end.

||axample.com^$third-party
||ads.example.net^

I mean

$ grep -E "^\|\|[a-z0-9.-]+\^([\$]third-party)?$" adblock.list

more aggressive, include popups filters

||axample.com^$popup,third-party
||axample.com^$popup
grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list

clean output (with 0.0.0.0)

grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list | sed 's/\^.*//' | sed 's/||/0.0.0.0 /' > hosts.list

@funilrys
Copy link
Member

funilrys commented May 7, 2023

@smed79 , I don't create such lists complex lists on my own, so I'm happy to have inputs from the community.

1st,

  • |http://example.com,https://example.de$script,image,domain=example.org|foo.example.net

There is no such cas in adblock (plus) syntax.

That's good to know. Will be fixed.

what is the intended behavior ?

Actually both. But I'm willing to make some changes. Please keep in mind that the adblock-decoder actually is a wrapper around the functionalities of PyFunceble.

What you describe as Case 1 is the behavior of the aggressive mode. Whether the Case2 should be the default behavior of PyFunceble.

for the second case (safe), the tool has to extract only domains that are flagged with the third-party option or limited with the symbol ^ at the end.

That's interesting. If everyone (cc: @Yuki2718 | @ryanbr | please flag others) agree on that, I can only see improvement.

I (and probably the community too) will be grateful if you could have the time to check the tests cases and let me know:

  • what is wrong
  • what should be changed
  • what should be ignored
  • what is missing

I'll then follow up with a complete rewrite of the decoder module.

@Yuki2718
Copy link

Yuki2718 commented May 8, 2023

TBH I don't understand what is the issue. I see ?adver=123.456 and .ppp are invalid, but don't know why the expected result is

kkk.lll
ads.mmm.nnn

only - what's wrong with extracting bbb.com? Seeing the test case, github.com and hello.world should be extracted from ~github.com,hello.world##.wrapper by default and be checked for their status, I have long assumed it's default behavior of PF but it isn't? Sure, comments should be skipped and ##[href^="https://funceble.funilrys.com/"] is something in between - personally I want this to be scanned as well but probably it should be optional so aggressive-only makes sense.

@Yuki2718
Copy link

Yuki2718 commented May 16, 2023

So I changed adblock_aggressive to true and scanned, then found it returns more domain than before which are all from cosmetic filters.
@funilrys As said above, github.com and hello.world should be extracted from ~github.com,hello.world##.wrapper by default. Also can you add a command line argument --adblock_aggressive so that the aggressive mode can be used without editing yaml files?

@smed79
Copy link
Author

smed79 commented May 17, 2023

but don't know why the expected result is

kkk.lll
ads.mmm.nnn

In the other cases we are targeting a specific file¹, folder², request type³ (image, script ...) or applying the filter for a specific⁴ website.

||ttt.com/*/ban.js <--¹
||sss.com^*/img/ <--²
||uuu.com/*$script <--³
||eee.com/*$image,domain=fff.com <--⁴

So the above example, the output should not include ttt.com, sss.com, uuu.com, eee.com in the default behavior.

if an adblock list have the filter ||www.google.com/ads/* we wil not block www.google.com in our hosts file.

please flag others

@mapx- @okiehsch @Alex-302 @AdamWr @Khrin any test/comment will be appreciated (sure if you have some free time).

@smed79
Copy link
Author

smed79 commented May 17, 2023

@funilrys see my comments before the # sign

        {
            "subject": '##[href^="https://funceble.funilrys.com/"]',
            "expected": {
                "aggressive": ["funceble.funilrys.com"],
                "standard": [],
            },
        },
        {
            "subject": "||test.hello.world^$domain=hello.world",
            "expected": {
                "aggressive": ["hello.world", "test.hello.world"],
                "standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
            },
        },a
        {
            "subject": '##div[href^="http://funilrys.com/"]',
            "expected": {"aggressive": ["funilrys.com"], "standard": []},
        },
        {
            "subject": 'com##[href^="ftp://funceble.funilrys-funceble.com/"]',
            "expected": {
                "aggressive": ["funceble.funilrys-funceble.com"],
                "standard": [],
            },
        },
        {
            "subject": "!@@||funceble.world/js",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "!||world.hello/*ad.xml",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "!funilrys.com##body",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "[AdBlock Plus 2.0]",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "@@||ads.example.com/notbanner^$~script",
            "expected": {"aggressive": ["ads.example.com"], "standard": []},
        },
        {"subject": "/banner/*/img^", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "||ad.example.co.uk^",
            "expected": {
                "aggressive": ["ad.example.co.uk"],
                "standard": ["ad.example.co.uk"],
            },
        },
        {
            "subject": "||ad.example.fr^$image,test",
            "expected": {
                "aggressive": ["ad.example.fr"],
                "standard": ["ad.example.fr"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||api.funilrys.com/widget/$",
            "expected": {
                "aggressive": ["api.funilrys.com"],
                "standard": ["api.funilrys.com"], # should be null because we are targeting a specific file/folder
            },
        },
        {
            "subject": "||api.example.com/papi/action$popup",
            "expected": {
                "aggressive": ["api.example.com"],
                "standard": ["api.example.com"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||funilrys.github.io$script,image",
            "expected": {
                "aggressive": ["funilrys.github.io"],
                "standard": ["funilrys.github.io"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||example.net^$script,image",
            "expected": {"aggressive": ["example.net"], 
            "standard": ["example.net"]}, # should be null because we are targeting a specific request type
        },
        {
            "subject": "||static.hello.world.examoke.org/*/exit-banner.js",
            "expected": {
                "aggressive": ["static.hello.world.examoke.org"],
                "standard": ["static.hello.world.examoke.org"], # should be null because we are targeting a specific file
            },
        },
        {
            "subject": "$domain=exam.pl|elpmaxe.pl|example.pl",
            "expected": {
                "aggressive": ["elpmaxe.pl", "exam.pl", "example.pl"],
                "standard": [],
            },
        },
        {
            "subject": "||example.de^helloworld.com", # unlikely scenario to have a similar filter case
            "expected": {
                "aggressive": ["example.de"],
                "standard": ["example.de"],
            },
        },
        {
            "subject": "|github.io|", # unlikely scenario
            "expected": {"aggressive": ["github.io"], "standard": ["github.io"]},
        },
        {
            "subject": "~github.com,hello.world##.wrapper",
            "expected": {"aggressive": ["github.com", "hello.world"], "standard": []},
        },
        {
            "subject": "bing.com,bingo.com#@##adBanner",
            "expected": {"aggressive": ["bing.com", "bingo.com"], "standard": []},
        },
        {
            "subject": "example.org#@##test",
            "expected": {"aggressive": ["example.org"], "standard": []},
        },
        {
            "subject": "hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld", # incorrect filter (for element hiding rules, domains are separated with commas)
            "expected": {
                "aggressive": ["hubgit.com|oohay.com|ipa.elloh.dlorw"],
                "standard": [],
            },
        },
        {"subject": ".com", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "||ggggggggggg.gq^$all",
            "expected": {
                "aggressive": ["ggggggggggg.gq"],
                "standard": ["ggggggggggg.gq"],
            },
        },
        {
            "subject": "facebook.com##.search",
            "expected": {"aggressive": ["facebook.com"], "standard": []},
        },
        {
            "subject": "||test.hello.world^$domain=hello.world",
            "expected": {
                "aggressive": ["hello.world", "test.hello.world"],
                "standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
            },
        },
        {
            "subject": "||examplae.com",
            "expected": {"aggressive": ["examplae.com"], "standard": ["examplae.com"]},
        },
        {
            "subject": "||examplbe.com^",
            "expected": {"aggressive": ["examplbe.com"], "standard": ["examplbe.com"]},
        },
        {
            "subject": "||examplce.com$third-party",
            "expected": {"aggressive": ["examplce.com"], "standard": ["examplce.com"]},
        },
        {
            "subject": "||examplde.com^$third-party",
            "expected": {"aggressive": ["examplde.com"], "standard": ["examplde.com"]},
        },
        {
            "subject": '##[href^="https://examplee.com/"]',
            "expected": {"aggressive": ["examplee.com"], "standard": []},
        },
        {
            "subject": "||examplfe.com^examplge.com", # same as the case in the line 103
            "expected": {"aggressive": ["examplfe.com"], "standard": ["examplfe.com"]},
        },
        {
            "subject": "||examplhe.com$script,image", # same as the case in the line 56 and 84
            "expected": {"aggressive": ["examplhe.com"], "standard": ["examplhe.com"]},
        },
        {
            "subject": "||examplie.com^$domain=domain1.com|domain2.com",
            "expected": {
                "aggressive": [
                    "domain1.com",
                    "domain2.com",
                    "examplie.com",
                ],
                "standard": ["examplie.com"], # should be null because the filter is applyed for a specific website
            },
        },
        {
            "subject": 'examlple.com##[href^="http://hello.world."], '
            '[href^="http://example.net/"]',
            "expected": {
                "aggressive": ["examlple.com", "example.net", "hello.world."],
                "standard": [],
            },
        },
        {"subject": "##.ad-href1", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "^hello^$domain=example.com",
            "expected": {"standard": [], "aggressive": ["example.com"]},
        },
        {
            "subject": "hello$domain=example.net|example.com",
            "expected": {"standard": [], "aggressive": ["example.com", "example.net"]},
        },
        {
            "subject": "hello^$domain=example.org|example.com|example.net",
            "expected": {
                "standard": [],
                "aggressive": ["example.com", "example.net", "example.org"],
            },
        },
        {
            "subject": "|http://example.org/hello-world^$scripts,image",
            "expected": {"aggressive": ["example.org"], 
            "standard": ["example.org"]}, # should be null because we are targeting a specific file/folder for a specific request type
        },
        {
            "subject": "|http://example.org/*",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|http://example.org^",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|http://example.org",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|https://example.org/^$domain=example.com",
            "expected": {
                "aggressive": ["example.com", "example.org"],
                "standard": ["example.org"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|ftp://example.org$domain=example.com|example.net",
            "expected": {
                "aggressive": ["example.com", "example.net", "example.org"],
                "standard": ["example.org"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|http://example.com$script,image,domain=example.org|foo.example.net",
            "expected": {
                "aggressive": ["example.com", "example.org", "foo.example.net"],
                "standard": ["example.com"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net", # incorrect filter (not possible to block many sites in the same filter or we have to use a regex rule)
            "expected": {
                "aggressive": [
                    "example.com",
                    "example.de",
                    "example.org",
                    "foo.example.net",
                ],
                "standard": ["example.com", "example.de"],
            },
        },
    ]

@Yuki2718
Copy link

If it only scans for rules to block the entire domain, --adblock option does not make much sense. What we expect for PF with --adblock is to check status of domain in all ABP (and AG, uBO, etc.) rules. I found even ##[href^="https://funceble.funilrys.com/"] case helps to pick up potentially obsolete rules (sure, generally href being dead does not mean the rule is obsolete though).

@smed79
Copy link
Author

smed79 commented May 17, 2023

If so, then it's a misunderstanding of the tool by me. for that i asked above what is the intended behavior (2nd).

😕 I thought we can use the tool to convert an adblock list to blacklist hosts file (Case 2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants