Pastebin Crawler

A simple Pastebin crawler which looks for interesting things and saves them to disk. Originally forked from https://github.com/FabioSpampinato/Pastebin-Crawler

Dependencies

PyQuery
Python 3

Make sure you use PyQuery for Python 3!

How it works

The tool periodically checks for new pastes and analyzes them. If they match a given pattern, their URL is stored in a .txt file, and their content in a file under a predefined directory. For instance, if the paste matches a password it can be placed in 'passwords.txt' and stored under 'passwords'.

The following parameters are configurable:

Refresh time (time slept between Pastebin checks, in seconds)
Delay (time between sequential accesses to each of Pastebin's pastes, in seconds)
Ban wait time (time to wait if a ban is detected, in minutes)
Timeout time (time to wait until a new attempt is made if connection times out due to a bad connection, in seconds)
Number of refreshes between flushes (number of refreshes until past Pastes are cleared from memory)
The regexes. See Using your own regexes

Command line options

./pastebin_crawler.py -h
Usage: pastebin_crawler.py [options]

Options:
  -h, --help            show this help message and exit
  -r REFRESH_TIME, --refresh-time=REFRESH_TIME
                        Set the refresh time (default: 30)
  -d DELAY, --delay-time=DELAY
                        Set the delay time (default: 1)
  -b BAN_WAIT, --ban-wait-time=BAN_WAIT
                        Set the ban wait time (default: 5)
  -f FLUSH_AFTER_X_REFRESHES, --flush-after-x-refreshes=FLUSH_AFTER_X_REFRESHES
                        Set the number of refreshes after which memory is
                        flushed (default: 100)
  -c CONNECTION_TIMEOUT, --connection-timeout=CONNECTION_TIMEOUT
                        Set the connection timeout waiting time (default: 60)

Using your own regexes

Regexes are stored in the regexes.txt file. It is trivial to modify this file and add new patterns to match.

The format is:

regex , URL logging file path/name , directory to store pasties

Examples:

(password\b|pass\b|pswd\b|passwd\b|pwd\b|pass\b), passwords.txt, passwords
(serial\b|cd-key\b|key\b|license\b),              serials.txt,   serials

And yes, you can use commas in the regex. Just don't do it in filename or directory. Really, don't!

What about crawling other websites, or automatically downloading the URLs found in a pastie?

Although not exactly the same, I have contributed to another tool which is a general purpose crawler for the web with much of the functionality of this Pastebin crawler. The project is called NowCrawling and you should go check it out if you want to do some more advanced crawling (e.g. search the whole web for regexes, download all files/images in a list of URLs, finding easily acessible TV series episodes, albums, etc). Check it out!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
LICENSE		LICENSE
Readme.md		Readme.md
pastebin_crawler.py		pastebin_crawler.py
regexes.txt		regexes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pastebin Crawler

Dependencies

How it works

Command line options

Using your own regexes

What about crawling other websites, or automatically downloading the URLs found in a pastie?

About

Releases

Packages

Languages

License

Jorl17/Pastebin-Crawler

Folders and files

Latest commit

History

Repository files navigation

Pastebin Crawler

Dependencies

How it works

Command line options

Using your own regexes

What about crawling other websites, or automatically downloading the URLs found in a pastie?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages