Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txtBefore running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the config.ini file.
Now, run:
$ python run.py <input-urls> <output-dataset>| LEXICAL | |||
|---|---|---|---|
| Count (.) in URL | Count (-) in URL | Count (_) in URL | Count (/) in URL | 
| Count (?) in URL | Count (=) in URL | Count (@) in URL | Count (&) in URL | 
| Count (!) in URL | Count ( ) in URL | Count (~) in URL | Count (,) in URL | 
| Count (+) in URL | Count (*) in URL | Count (#) in URL | Count ($) in URL | 
| Count (%) in URL | URL LengthL | TLD amount in URL | Count (.) in Domain | 
| Count (-) in Domain | Count (_) in Domain | Count (/) in Domain | Count (?) in Domain | 
| Count (=) in Domain | Count (@) in Domain | Count (&) in Domain | Count (!) in Domain | 
| Count ( ) in Domain | Count (~) in Domain | Count (,) in Domain | Count (+) in Domain | 
| Count (*) in Domain | Count (#) in Domain | Count ($) in Domain | Count (%) in Domain | 
| Domain Length | Quantidade de vogais in Domain | URL domain in IP address format | Domain contains the key words "server" or "client" | 
| Count (.) in Directory | Count (-) in Directory | Count (_) in Directory | Count (/) in Directory | 
| Count (?) in Directory | Count (=) in Directory | Count (@) in Directory | Count (&) in Directory | 
| Count (!) in Directory | Count ( ) in Directory | Count (~) in Directory | Count (,) in Directory | 
| Count (+) in Directory | Count (*) in Directory | Count (#) in Directory | Count ($) in Directory | 
| Count (%) in Directory | Directory Length | Count (.) in file | Count (-) in file | 
| Count (_) in file | Count (/) in file | Count (?) in file | Count (=) in file | 
| Count (@) in file | Count (&) in file | Count (!) in file | Count ( ) in file | 
| Count (~) in file | Count (,) in file | Count (+) in file | Count (*) in file | 
| Count (#) in file | Count ($) in file | Count (%) in file | File length | 
| Count (.) in parameters | Count (-) in parameters | Count (_) in parameters | Count (/) in parameters | 
| Count (?) in parameters | Count (=) in parameters | Count (@) in parameters | Count (&) in parameters | 
| Count (!) in parameters | Count ( ) in parameters | Count (~) in parameters | Count (,) in parameters | 
| Count (+) in parameters | Count (*) in parameters | Count (#) in parameters | Count ($) in parameters | 
| Count (%) in parameters | Length of parameters | TLD presence in arguments | Number of parameters | 
| Email present at URL | File extension | ||
| BLACKLIST | |||
|---|---|---|---|
| Presence of the URL in blacklists | Presence of the IP Address in blacklists | Presence of the domain in Blacklists | |
| HOST | |||
|---|---|---|---|
| Presence of the domain in RBL (Real-time Blackhole List) | Search time (response) domain (lookup) | Domain has SPF? | Geographical location of IP | 
| AS Number (or ASN) | PTR of IP | Time (in days) of domain activation | Time (in days) of domain expiration | 
| Number of resolved IPs | Number of resolved name servers (NameServers - NS) | Number of MX Servers | Time-to-live (TTL) value associated with hostname | 
| OTHERS | |||
|---|---|---|---|
| Valid TLS / SSL Certificate | Number of redirects | Check if URL is indexed on Google | Check if domain is indexed on Google | 
| Uses URL shortener service | |||
Any contribution is appreciated.
- Clone the project:
$ git clone https://github.com/lucasayres/url-feature-extractor.git
- Make your changes in a new git branch:
$ git checkout -b my-branch master
- 
Add your changes. 
- 
Push your branch to Github. 
- 
Create a PR to master.