Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt
Before running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the config.ini
file.
Now, run:
$ python run.py <input-urls> <output-dataset>
LEXICAL | |||
---|---|---|---|
Count (.) in URL | Count (-) in URL | Count (_) in URL | Count (/) in URL |
Count (?) in URL | Count (=) in URL | Count (@) in URL | Count (&) in URL |
Count (!) in URL | Count ( ) in URL | Count (~) in URL | Count (,) in URL |
Count (+) in URL | Count (*) in URL | Count (#) in URL | Count ($) in URL |
Count (%) in URL | URL LengthL | TLD amount in URL | Count (.) in Domain |
Count (-) in Domain | Count (_) in Domain | Count (/) in Domain | Count (?) in Domain |
Count (=) in Domain | Count (@) in Domain | Count (&) in Domain | Count (!) in Domain |
Count ( ) in Domain | Count (~) in Domain | Count (,) in Domain | Count (+) in Domain |
Count (*) in Domain | Count (#) in Domain | Count ($) in Domain | Count (%) in Domain |
Domain Length | Quantidade de vogais in Domain | URL domain in IP address format | Domain contains the key words "server" or "client" |
Count (.) in Directory | Count (-) in Directory | Count (_) in Directory | Count (/) in Directory |
Count (?) in Directory | Count (=) in Directory | Count (@) in Directory | Count (&) in Directory |
Count (!) in Directory | Count ( ) in Directory | Count (~) in Directory | Count (,) in Directory |
Count (+) in Directory | Count (*) in Directory | Count (#) in Directory | Count ($) in Directory |
Count (%) in Directory | Directory Length | Count (.) in file | Count (-) in file |
Count (_) in file | Count (/) in file | Count (?) in file | Count (=) in file |
Count (@) in file | Count (&) in file | Count (!) in file | Count ( ) in file |
Count (~) in file | Count (,) in file | Count (+) in file | Count (*) in file |
Count (#) in file | Count ($) in file | Count (%) in file | File length |
Count (.) in parameters | Count (-) in parameters | Count (_) in parameters | Count (/) in parameters |
Count (?) in parameters | Count (=) in parameters | Count (@) in parameters | Count (&) in parameters |
Count (!) in parameters | Count ( ) in parameters | Count (~) in parameters | Count (,) in parameters |
Count (+) in parameters | Count (*) in parameters | Count (#) in parameters | Count ($) in parameters |
Count (%) in parameters | Length of parameters | TLD presence in arguments | Number of parameters |
Email present at URL | File extension |
BLACKLIST | |||
---|---|---|---|
Presence of the URL in blacklists | Presence of the IP Address in blacklists | Presence of the domain in Blacklists |
HOST | |||
---|---|---|---|
Presence of the domain in RBL (Real-time Blackhole List) | Search time (response) domain (lookup) | Domain has SPF? | Geographical location of IP |
AS Number (or ASN) | PTR of IP | Time (in days) of domain activation | Time (in days) of domain expiration |
Number of resolved IPs | Number of resolved name servers (NameServers - NS) | Number of MX Servers | Time-to-live (TTL) value associated with hostname |
OTHERS | |||
---|---|---|---|
Valid TLS / SSL Certificate | Number of redirects | Check if URL is indexed on Google | Check if domain is indexed on Google |
Uses URL shortener service |
Any contribution is appreciated.
- Clone the project:
$ git clone https://github.com/lucasayres/url-feature-extractor.git
- Make your changes in a new git branch:
$ git checkout -b my-branch master
-
Add your changes.
-
Push your branch to Github.
-
Create a PR to master.