Skip to content

Yet another tiny crawler in python, using Bing Search API, Boilerpipe and Adblock.

Notifications You must be signed in to change notification settings

mazieres/crawtext_last

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Crawtext

Yet another tiny crawler in Python.

Crawtext starts crawling seeds, which can be provided by the user or via Bing Search API. It extracts relevant content of the page using Boilerpipe. If the page contain the crawl's query, URLs are extracted from the selected content. If they are not considered as spam by adblock, they get crawled at the next round until the wished depth is reached.

Crawtext save the JSON-formatted results in a file. Each result is a pertinent crawled page with its:

  • pointers: The pages in the given dataset pointing to this page.
  • content: The extracted content from the page in text format.
  • outlinks: The pages in the given dataset pointed by this page.

Installation

Dependencies on beautifulsoup, requests and boilerpipe, all of them being available through pip.

Usage

crawtext('algues vertes OR algue verte', 				# query
		0, 												# depth
		'/Users/mazieres/code/crawtext/results.json',		# absolute path to result file
		bing_account_key='============================================', # Bing Search API key
		local_seeds='/Users/mazieres/code/crawtext/myseeds.txt') 		# absolute path to local seeds

Arguments are:

  • The query that make a page pertinent or not. It support AND and OR operators.
  • The depth indidactes the number of rounds done by the crawler.
  • The absolute Path to result file.
  • The secret key of your Bing Search API account, available for free here.
  • The absolute path to your local seeds' urls, one url per line.

Contribute

Fork (and pull), or use the Issue tracker.

License

Released under MIT License.

About

Developed by @mazieres, forked from @jphcoi, both efforts being part of Cortext project.

About

Yet another tiny crawler in python, using Bing Search API, Boilerpipe and Adblock.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages