Skip to content

jorgonzalez/mail-extraction-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mail-extraction-tool

Tool to scrap emails from a list of websites. Takes the list of websites as a parameter, in format TSV (tab separated values) e.g.:

WEBUID        WEBSITE        URL

abc123        Website 1        http://websitename1.url

azk988        Website 2        http://websitename2.url

gju386        Website N        http://websitenameN.url

Usage example: ./extract_mail_from_url.sh list_of_websites_and_urls_in_tbs_format.tbs.

The tool downloads the website using wget and then searches for different email addresses formats with regular expressions:

  • [email protected] [a-z0-9.-]@[a-z0-9.-].[a-z]
  • text (at) text.domain [a-z0-9.-] (at) [a-z0-9.-].[a-z]
  • text(at)text.domain [a-z0-9.-](at)[a-z0-9.-].[a-z]
  • text[at]text[dot]domain [a-z0-9.-][at][a-z0-9.-][dot][a-z]
  • text[ät]text.domain [a-z0-9.-][ät][a-z0-9.-].[a-z]
  • text [at] text.domain [a-z0-9.-][at][a-z0-9.-].[a-z]
  • text [at] text [punkt] domain [a-z0-9.-][at][a-z0-9.-][punkt][a-z]
  • text(at)text(dot)domain [a-z0-9.-](at)[a-z0-9.-](dot)[a-z]
  • text at text.domain [a-z0-9.-] at [a-z0-9.-].[a-z]
  • text [at] text [dot] domain [a-z0-9.-] [at] [a-z0-9.-] [dot] [a-z]

Configurable variables in the tool:

  • TIMEOUT: default 180 seconds, can be passed as EVN VAR; doesn't work in MacOS since timeout is not a standard command in darwin. If you want this option to work in MacOS, read https://gist.github.com/dasgoll/7b1a796d6e42cb66508bc504bb518f82
  • RETRIES: default 3 times; number of times the website will be tried to get downloaded.
  • FILTER_LIST_FILE: default filter_list; name of the filter list of optional words to exclude from the emails addresses scrapped by the tool.
  • TMP_FILE: default "website_"${WEBSITE_LIST_FILE}; temporary file where the website is downloaded and then deleted after being processed for email scrapping.
  • OUTPUT_FILE: default ${WEBSITE_LIST_FILE}"_WITH_MAILS.tsv"; filename where the extration tool will output the results of the scrapping.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages