Skip to content

jorgonzalez/mail-extraction-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mail-extraction-tool

Tool to scrap emails from a list of websites. Takes the list of websites as a parameter, in format TSV (tab separated values) e.g.:

WEBUID        WEBSITE        URL

abc123        Website 1        http://websitename1.url

azk988        Website 2        http://websitename2.url

gju386        Website N        http://websitenameN.url

Usage example: ./extract_mail_from_url.sh list_of_websites_and_urls_in_tbs_format.tbs.

The tool downloads the website using wget and then searches for different email addresses formats with regular expressions:

  • [email protected] [a-z0-9.-]@[a-z0-9.-].[a-z]
  • text (at) text.domain [a-z0-9.-] (at) [a-z0-9.-].[a-z]
  • text(at)text.domain [a-z0-9.-](at)[a-z0-9.-].[a-z]
  • text[at]text[dot]domain [a-z0-9.-][at][a-z0-9.-][dot][a-z]
  • text[ät]text.domain [a-z0-9.-][ät][a-z0-9.-].[a-z]
  • text [at] text.domain [a-z0-9.-][at][a-z0-9.-].[a-z]
  • text [at] text [punkt] domain [a-z0-9.-][at][a-z0-9.-][punkt][a-z]
  • text(at)text(dot)domain [a-z0-9.-](at)[a-z0-9.-](dot)[a-z]
  • text at text.domain [a-z0-9.-] at [a-z0-9.-].[a-z]
  • text [at] text [dot] domain [a-z0-9.-] [at] [a-z0-9.-] [dot] [a-z]

Configurable variables in the tool:

  • TIMEOUT: default 180 seconds, can be passed as EVN VAR; doesn't work in MacOS since timeout is not a standard command in darwin. If you want this option to work in MacOS, read https://gist.github.com/dasgoll/7b1a796d6e42cb66508bc504bb518f82
  • RETRIES: default 3 times; number of times the website will be tried to get downloaded.
  • FILTER_LIST_FILE: default filter_list; name of the filter list of optional words to exclude from the emails addresses scrapped by the tool.
  • TMP_FILE: default "website_"${WEBSITE_LIST_FILE}; temporary file where the website is downloaded and then deleted after being processed for email scrapping.
  • OUTPUT_FILE: default ${WEBSITE_LIST_FILE}"_WITH_MAILS.tsv"; filename where the extration tool will output the results of the scrapping.

Releases

No releases published

Packages

No packages published

Languages