Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using Cisco's OpenDNS/Umbrella 1 Million to gauge #128

Open
jawz101 opened this issue Oct 14, 2020 · 16 comments
Open

using Cisco's OpenDNS/Umbrella 1 Million to gauge #128

jawz101 opened this issue Oct 14, 2020 · 16 comments

Comments

@jawz101
Copy link

jawz101 commented Oct 14, 2020

http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

Take a look at my rationale on this Adaway commit:

AdAway/adaway.github.io@fe5414a

In addition to registration lookups, I also find it helpful to see if a DNS lookup is actively being used. If it's not in circulation then there is not much reason to block

@spirillen
Copy link
Contributor

step 1 - http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

Downloaded the daily logs from August 1st, 2020 to October 13th, 2020 top 1 million DNS lookups for all of Cisco's OpenDNS/Umbrella users.

If an entry showed on any one of those days I kept it. Otherwise it was scrapped. The premise being if a DNS lookup is not popular enough to show on a massive DNS log file of that many user devices it is not worth trying to block it.

I would say, to make this real useful you should know if any at all have had a lookup for any of the domains. Some of them can/will be hidden lookup ie. through proxying or CNAME(DNAME)

What would be more interesting would be to collect Newly Observed Domain Tracking for classification.

How ever I do follow your though of cutting the list(s) to what is relevant for users.

Limitation to most used, well personally I disagree as the worst stuff is outside the top 1m.

@jawz101
Copy link
Author

jawz101 commented Oct 15, 2020

I would say, to make this real useful you should know if any at all have had a lookup for any of the domains.

I'm not sure I understand. The 1 million list is the most popular 1 million DNS lookups by all of Cisco's Umbrella customers.

I do like the idea of newly observed domain tracking. I wish I had the knowledge or tools to pull down their list each day and track the changes over time. If newly added domains appear for more than, say, 7 days- put them in a separate bucket for review.

Also, if I had some way to do statistical analysis of the other characteristics of the domain such as the naming of them (e.g. phrases such as loca*, *lytic, pixel, tag) and their whois registration (country, registrar, etc.) info to weight those that bear a resemblance to tracking.

@spirillen
Copy link
Contributor

I would say, to make this real useful you should know if any at all have had a lookup for any of the domains.

I'm not sure I understand. The 1 million list is the most popular 1 million DNS lookups by all of Cisco's Umbrella customers.

If you would like to clear out data, that have not been used for x time, you would need some way to collect data about which rule have been applied to any users. Just clear out rules based on a third party's service is to me a bad idea. (MY view on why)

This would be data collecting and therefore anonymized and require you build some tool and not only a list.

I do like the idea of newly observed domain tracking

This can easily be setup, but it will requires than at bigger number of users would be using my DNS servers to be any useful collector. Posted a couple of examples here https://github.com/mypdns/NOD

Also, if I had some way to do statistical analysis.

What are you using to collect/find the information's you are writing your rules from? at first hands it sounds like some regex that needs to be written

@jawz101
Copy link
Author

jawz101 commented Oct 15, 2020

What are you using to collect/find the information's you are writing your rules from? at first hands it sounds like some regex that needs to be written

I collect DNS lookups mainly from my devices' NetGuard logs. Download crappy app, run it a while, pick and choose the questionable entries, block, see if it breaks the app, add to the list if it still works. There were a few times when I tried pulling in entries from other sources but I want it to be a list of things I've actually seen and know is just from mobile apps (that is, I try to limit including purely web page domains. There are a million blocklists for those).

I use quite a bit of regex but I'm talking about something a bit fancier. Something along the lines of taking a list with columns for domain name, registrar, certficate info (owner, etc.) and manually marking those I know are ads and trackers. Then, have some sort of statistical/machine learning model that characterizes the ad/trackers I've picked and creates its own decision on how it would classify a larger list of domains as good or bad.

I'm trying to avoid saying "machine learning" :P

@spirillen
Copy link
Contributor

I'm trying to avoid saying "machine learning" :P

And how did that one goes 😆

Sidenote:
I follow you, you might be interesting in the Central SQL we are building, also in the yet hidden project https://www.mypdns.org/project/view/15/ | https://github.com/matrix-rocks which is al about categorizing domains for other to extract. And your thought of how to approach this sounds good to my ears.

@funilrys
Copy link
Owner

funilrys commented Apr 28, 2021

Reopening, because including such a dataset as a testing mode or comparison mode may be interesting in the future.

@funilrys funilrys reopened this Apr 28, 2021
@spirillen
Copy link
Contributor

Hey @jawz101

As I'm re-reading this issue, I'm thinking, Could you use UHBW: https://github.com/Ultimate-Hosts-Blacklist/whitelist/tree/script in conjunction with the top 1mil list?

@jawz101
Copy link
Author

jawz101 commented May 3, 2021

How so? Like to see if a blocklist is overblocking?

@spirillen
Copy link
Contributor

spirillen commented May 3, 2021

reverse engineering 😏 if you like I can help make it spinning, it will likely be a bit clumsy and end up using grep in a for loop

#!/usr/bin/env bash

HOSTSFILE="/etc/hosts"
TOP1MILL="top1mil"
TESTFILE="test.file"


for l in "$TOP1MILL"
do
    grep "$l" "$HOSTSFILE" >> "$TESTFILE"
done

conda activate pyfunceble4

pyfunceble -a f "$TESTFILE"

rm "$TESTFILE"

Or something like that

@jawz101
Copy link
Author

jawz101 commented May 4, 2021

I do not understand

@spirillen
Copy link
Contributor

spirillen commented Jun 1, 2021

You are extracting matches between the top1mill and source file and generates a new file to test and distribute

# A foor loop to extract lines from the top1million files
for l in "$TOP1MILL"
do
    grep "$l" "$HOSTSFILE" >> "$TESTFILE"
done

With the grep command you are reading lines and making a match to your own source (hosts) file, then for each match between the source (hosts) file and the top1million file will be outputted to a new file, this one could now be tested with pyfunceble and only contains records from the top1mill list. This also means you shouldn't need to do a test with PyFunceble, as the top1mill list only contains ACTIVE & VALID records.

@jawz101
Copy link
Author

jawz101 commented Jun 1, 2021

ah. gotcha. So it's like checking domains PyFunceble Offline... basically. Since we now have a list of DNS records the world already look up.

I love those files. I will say I try to download like "the past 30 days" of them because what if, say, a mobile app uploads a file of data it collects on you every 15 days. So maybe a DNS record doesn't get hit every day of the month. That was my thought process at least.

I wish I had a database of all of the daily top 1 million files and could just query it all of the time. I've put several in a sqlite database to do something like that and it's still pretty resource intense. I've also noticed the files contain a lot of crap because some devices seem to do some sketch things. Ex: there are a bunch of entries for tendawifi and totolink that will be something like facebook.com.tendawifi.com. That tells me there are some weird Chinese routers that do some caching thing and it sounds kinda sketchy. Like it mitm's your traffic. Also, after the top 100,000 or so the list starts to be alphabetized. Have you noticed that? That tells me the rankings aren't entirely accurate.

@spirillen
Copy link
Contributor

spirillen commented Jun 1, 2021

SI, noticed that

https://github.com/spirillen/adaway.github.io.top1mill.hosts/blob/11ae4e93d1a7a001d5637f325eb3c16fd0ad3805/compare.sh#L55

PS, the script can be run from a terminal.... the GHA are teasing me....

Just remember to out comment

https://github.com/spirillen/adaway.github.io.top1mill.hosts/blob/11ae4e93d1a7a001d5637f325eb3c16fd0ad3805/compare.sh#L77

I've put several in a sqlite database to do something like that and it's still pretty resource intense

Can't you just extract from the DB (Array) and compare to a new top-1mil? with either comm, diff or grep -Ff and then read-in new values?

Also, after the top 100,000 or so the list starts to be alphabetized

I've noticed you aren't using the right top-1mil 😏 yo only uses the top-496.xxx 😃

The right top-1mill is here: http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip vs the one I've found you are linking to.

@jawz101
Copy link
Author

jawz101 commented Jun 1, 2021

that's the same one I link to. http://s3-us-west-1.amazonaws.com/umbrella-static/index.html . the one on that page

Can't you just extract from the DB (Array) and compare to a new top-1mil? with either comm, diff or grep -Ff and then read-in new values?

I could but I like to see how many times a domain appeared on each of the past 30 days. Things like that.

@spirillen
Copy link
Contributor

that's the same one I link to. http://s3-us-west-1.amazonaws.com/umbrella-static/index.html . the one on that page

Weird, I only had about 495.xxx lines from the one I found you linked to, while the other one from http://s3-us-west-1.amazonaws.com/umbrella-static/index.html provided the 1mill records.

I could but I like to see how many times a domain appeared on each of the past 30 days. Things like that.

I see, that a lot of stats, what are you table(s) layouts? maybe I can come up with an idea for something.

@spirillen
Copy link
Contributor

OT

@jawz101 the script is now running as supposed hence not as I thought it should be, but it is running....

https://github.com/spirillen/adaway.github.io.top1mill.hosts If you like I can transfer the repo to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants