In this repository the two variants of the phishing dataset are presented.
To preview the dataset interactively and/or tailor it to your needs, please visit a dedicated web application.
Short description of the full variant dataset:
- Total number of instances: 88,647
- Number of legitimate website instances (labeled as 0): 58,000
- Number of phishing website instances (labeled as 1): 30,647
 
- Total number of features: 111 (without target)
Short description of the small variant dataset:
- Total number of instances: 58,645
- Number of legitimate website instances (labeled as 0): 27,998
- Number of phishing website instances (labeled as 1): 30,647
 
- Total number of features: 111 (without target)
| Feature | Description | 
|---|---|
| qty_dot_url | count (.) in URL | 
| qty_hyphen_url | count (-) in URL | 
| qty_underline_url | count (_) in URL | 
| qty_slash_url | count (/) in URL | 
| qty_questionmark_url | count (?) in URL | 
| qty_equal_url | count (=) in URL | 
| qty_at_url | count (@) in URL | 
| qty_and_url | count (&) in URL | 
| qty_exclamation_url | count (!) in URL | 
| qty_space_url | count ( ) in URL | 
| qty_tilde_url | count (~) in URL | 
| qty_comma_url | count (,) in URL | 
| qty_plus_url | count (+) in URL | 
| qty_asterisk_url | count (*) in URL | 
| qty_hashtag_url | count (#) in URL | 
| qty_dollar_url | count ($) in URL | 
| qty_percent_url | count (%) in URL | 
| qty_tld_url | top-level-domain length | 
| length_url | URL length | 
| qty_dot_domain | count (.) in domain | 
| qty_hyphen_domain | count (-) in domain | 
| qty_underline_domain | count (_) in domain | 
| qty_slash_domain | count (/) in domain | 
| qty_questionmark_domain | count (?) in domain | 
| qty_equal_domain | count (=) in domain | 
| qty_at_domain | count (@) in domain | 
| qty_and_domain | count (&) in domain | 
| qty_exclamation_domain | count (!) in domain | 
| qty_space_domain | count ( ) in domain | 
| qty_tilde_domain | count (~) in domain | 
| qty_comma_domain | count (,) in domain | 
| qty_plus_domain | count (+) in domain | 
| qty_asterisk_domain | count (*) in domain | 
| qty_hashtag_domain | count (#) in domain | 
| qty_dollar_domain | count ($) in domain | 
| qty_percent_domain | count (%) in domain | 
| qty_vowels_domain | count vowels in domain | 
| domain_length | domain length | 
| domain_in_ip | URL domain in IP address format | 
| server_client_domain | domain contains the keywords "server" or "client" | 
| qty_dot_directory | count (.) in directory | 
| qty_hyphen_directory | count (-) in directory | 
| qty_underline_directory | count (_) in directory | 
| qty_slash_directory | count (/) in directory | 
| qty_questionmark_directory | count (?) in directory | 
| qty_equal_directory | count (=) in directory | 
| qty_at_directory | count (@) in directory | 
| qty_and_directory | count (&) in directory | 
| qty_exclamation_directory | count (!) in directory | 
| qty_space_directory | count ( ) in directory | 
| qty_tilde_directory | count (~) in directory | 
| qty_comma_directory | count (,) in directory | 
| qty_plus_directory | count (+) in directory | 
| qty_asterisk_directory | count (*) in directory | 
| qty_hashtag_directory | count (#) in directory | 
| qty_dollar_directory | count ($) in directory | 
| qty_percent_directory | count (%) in directory | 
| directory_length | directory length | 
| qty_dot_file | count (.) in file | 
| qty_hyphen_file | count (-) in file | 
| qty_underline_file | count (_) in file | 
| qty_slash_file | count (/) in file | 
| qty_questionmark_file | count (?) in file | 
| qty_equal_file | count (=) in file | 
| qty_at_file | count (@) in file | 
| qty_and_file | count (&) in file | 
| qty_exclamation_file | count (!) in file | 
| qty_space_file | count ( ) in file | 
| qty_tilde_file | count (~) in file | 
| qty_comma_file | count (,) in file | 
| qty_plus_file | count (+) in file | 
| qty_asterisk_file | count (*) in file | 
| qty_hashtag_file | count (#) in file | 
| qty_dollar_file | count ($) in file | 
| qty_percent_file | count (%) in file | 
| file_length | file length | 
| qty_dot_params | count (.) in parameters | 
| qty_hyphen_params | count (-) in parameters | 
| qty_underline_params | count (_) in parameters | 
| qty_slash_params | count (/) in parameters | 
| qty_questionmark_params | count (?) in parameters | 
| qty_equal_params | count (=) in parameters | 
| qty_at_params | count (@) in parameters | 
| qty_and_params | count (&) in parameters | 
| qty_exclamation_params | count (!) in parameters | 
| qty_space_params | count ( ) in parameters | 
| qty_tilde_params | count (~) in parameters | 
| qty_comma_params | count (,) in parameters | 
| qty_plus_params | count (+) in parameters | 
| qty_asterisk_params | count (*) in parameters | 
| qty_hashtag_params | count (#) in parameters | 
| qty_dollar_params | count ($) in parameters | 
| qty_percent_params | count (%) in parameters | 
| params_length | parameters length | 
| tld_present_params | TLD presence in arguments | 
| qty_params | number of parameters | 
| email_in_url | email present in URL | 
| time_response | search time (response) domain (lookup) | 
| domain_spf | domain has SPF | 
| asn_ip | AS Number (or ASN) | 
| time_domain_activation | time (in days) of domain activation | 
| time_domain_expiration | time (in days) of domain expiration | 
| qty_ip_resolved | number of resolved IPs | 
| qty_nameservers | number of resolved name servers (NameServers - NS) | 
| qty_mx_servers | number of MX Servers | 
| ttl_hostname | time-to-live (TTL) value associated with hostname | 
| tls_ssl_certificate | valid TLS / SSL Certificate | 
| qty_redirects | number of redirects | 
| url_google_index | check if URL is indexed on Google | 
| domain_google_index | check if domain is indexed on Google | 
| url_shortened | check if URL is shortened | 
| phishing | is phishing website | 
G. Vrbančič, I. Jr. Fister, V. Podgorelec. Datasets for Phishing Websites Detection. Data in Brief, Vol. 33, 2020, DOI: 10.1016/j.dib.2020.106438