URL Feature Extractor

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.

Install

$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt

How to use

Before running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the config.ini file.

Now, run:

$ python run.py <input-urls> <output-dataset>

Features implemented

LEXICAL
Count (.) in URL	Count (-) in URL	Count (_) in URL	Count (/) in URL
Count (?) in URL	Count (=) in URL	Count (@) in URL	Count (&) in URL
Count (!) in URL	Count ( ) in URL	Count (~) in URL	Count (,) in URL
Count (+) in URL	Count (*) in URL	Count (#) in URL	Count ($) in URL
Count (%) in URL	URL LengthL	TLD amount in URL	Count (.) in Domain
Count (-) in Domain	Count (_) in Domain	Count (/) in Domain	Count (?) in Domain
Count (=) in Domain	Count (@) in Domain	Count (&) in Domain	Count (!) in Domain
Count ( ) in Domain	Count (~) in Domain	Count (,) in Domain	Count (+) in Domain
Count (*) in Domain	Count (#) in Domain	Count ($) in Domain	Count (%) in Domain
Domain Length	Quantidade de vogais in Domain	URL domain in IP address format	Domain contains the key words "server" or "client"
Count (.) in Directory	Count (-) in Directory	Count (_) in Directory	Count (/) in Directory
Count (?) in Directory	Count (=) in Directory	Count (@) in Directory	Count (&) in Directory
Count (!) in Directory	Count ( ) in Directory	Count (~) in Directory	Count (,) in Directory
Count (+) in Directory	Count (*) in Directory	Count (#) in Directory	Count ($) in Directory
Count (%) in Directory	Directory Length	Count (.) in file	Count (-) in file
Count (_) in file	Count (/) in file	Count (?) in file	Count (=) in file
Count (@) in file	Count (&) in file	Count (!) in file	Count ( ) in file
Count (~) in file	Count (,) in file	Count (+) in file	Count (*) in file
Count (#) in file	Count ($) in file	Count (%) in file	File length
Count (.) in parameters	Count (-) in parameters	Count (_) in parameters	Count (/) in parameters
Count (?) in parameters	Count (=) in parameters	Count (@) in parameters	Count (&) in parameters
Count (!) in parameters	Count ( ) in parameters	Count (~) in parameters	Count (,) in parameters
Count (+) in parameters	Count (*) in parameters	Count (#) in parameters	Count ($) in parameters
Count (%) in parameters	Length of parameters	TLD presence in arguments	Number of parameters
Email present at URL	File extension

BLACKLIST
Presence of the URL in blacklists	Presence of the IP Address in blacklists	Presence of the domain in Blacklists

HOST
Presence of the domain in RBL (Real-time Blackhole List)	Search time (response) domain (lookup)	Domain has SPF?	Geographical location of IP
AS Number (or ASN)	PTR of IP	Time (in days) of domain activation	Time (in days) of domain expiration
Number of resolved IPs	Number of resolved name servers (NameServers - NS)	Number of MX Servers	Time-to-live (TTL) value associated with hostname

OTHERS
Valid TLS / SSL Certificate	Number of redirects	Check if URL is indexed on Google	Check if domain is indexed on Google
Uses URL shortener service

Contributing

Any contribution is appreciated.

Submitting a Pull Request (PR)

Clone the project:

$ git clone https://github.com/lucasayres/url-feature-extractor.git

Make your changes in a new git branch:

$ git checkout -b my-branch master

Add your changes.
Push your branch to Github.
Create a PR to master.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

URL Feature Extractor

Install

How to use

Features implemented

Contributing

Submitting a Pull Request (PR)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
lib		lib
pythonwhois		pythonwhois
urls		urls
.gitignore		.gitignore
README.md		README.md
config.ini		config.ini
extract.py		extract.py
get_database_phishtank.py		get_database_phishtank.py
requirements.txt		requirements.txt
run.py		run.py

GregaVrbancic/url-feature-extractor

Folders and files

Latest commit

History

Repository files navigation

URL Feature Extractor

Install

How to use

Features implemented

Contributing

Submitting a Pull Request (PR)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages