A simple Pastebin crawler which looks for interesting things and saves them to disk. Originally forked from https://github.com/FabioSpampinato/Pastebin-Crawler
- PyQuery
- Python 3
Make sure you use PyQuery for Python 3!
The tool periodically checks for new pastes and analyzes them. If they match a given pattern, their URL is stored in a .txt file, and their content in a file under a predefined directory. For instance, if the paste matches a password it can be placed in 'passwords.txt' and stored under 'passwords'.
The following parameters are configurable:
- Refresh time (time slept between Pastebin checks, in seconds)
- Delay (time between sequential accesses to each of Pastebin's pastes, in seconds)
- Ban wait time (time to wait if a ban is detected, in minutes)
- Timeout time (time to wait until a new attempt is made if connection times out due to a bad connection, in seconds)
- Number of refreshes between flushes (number of refreshes until past Pastes are cleared from memory)
- The regexes. See Using your own regexes
./pastebin_crawler.py -h
Usage: pastebin_crawler.py [options]
Options:
-h, --help show this help message and exit
-r REFRESH_TIME, --refresh-time=REFRESH_TIME
Set the refresh time (default: 30)
-d DELAY, --delay-time=DELAY
Set the delay time (default: 1)
-b BAN_WAIT, --ban-wait-time=BAN_WAIT
Set the ban wait time (default: 5)
-f FLUSH_AFTER_X_REFRESHES, --flush-after-x-refreshes=FLUSH_AFTER_X_REFRESHES
Set the number of refreshes after which memory is
flushed (default: 100)
-c CONNECTION_TIMEOUT, --connection-timeout=CONNECTION_TIMEOUT
Set the connection timeout waiting time (default: 60)
Regexes are stored in the regexes.txt file. It is trivial to modify this file and add new patterns to match.
The format is:
regex , URL logging file path/name , directory to store pasties
Examples:
(password\b|pass\b|pswd\b|passwd\b|pwd\b|pass\b), passwords.txt, passwords
(serial\b|cd-key\b|key\b|license\b), serials.txt, serials
And yes, you can use commas in the regex. Just don't do it in filename or directory. Really, don't!
Although not exactly the same, I have contributed to another tool which is a general purpose crawler for the web with much of the functionality of this Pastebin crawler. The project is called NowCrawling and you should go check it out if you want to do some more advanced crawling (e.g. search the whole web for regexes, download all files/images in a list of URLs, finding easily acessible TV series episodes, albums, etc). Check it out!