Yet another tiny crawler in Python.
Crawtext starts crawling seeds, which can be provided by the user or via Bing Search API. It extracts relevant content of the page using Boilerpipe. If the page contain the crawl's query, URLs are extracted from the selected content. If they are not considered as spam by adblock, they get crawled at the next round until the wished depth is reached.
Crawtext save the JSON-formatted results in a file. Each result is a pertinent crawled page with its:
pointers
: The pages in the given dataset pointing to this page.content
: The extracted content from the page in text format.outlinks
: The pages in the given dataset pointed by this page.
Dependencies on beautifulsoup
, requests
and boilerpipe
, all of them being available through pip.
crawtext('algues vertes OR algue verte', # query
0, # depth
'/Users/mazieres/code/crawtext/results.json', # absolute path to result file
bing_account_key='============================================', # Bing Search API key
local_seeds='/Users/mazieres/code/crawtext/myseeds.txt') # absolute path to local seeds
Arguments are:
- The query that make a page pertinent or not. It support
AND
andOR
operators. - The depth indidactes the number of rounds done by the crawler.
- The absolute Path to result file.
- The secret key of your Bing Search API account, available for free here.
- The absolute path to your local seeds' urls, one url per line.
Fork (and pull), or use the Issue tracker.
Released under MIT License.
Developed by @mazieres, forked from @jphcoi, both efforts being part of Cortext project.