Skip to content

Python Crawler for collecting domain specific web corpora

License

Notifications You must be signed in to change notification settings

mazieres/crawtext_exp

 
 

Repository files navigation

python crawl_trial.py will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of crawl_trial.py. The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages. 

required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml


TODO list:

  * better scrapping of webpages, for the moment decruft is doing great but can be enhanced (http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/)
  * automatically extract dates from extracted texts (good python solutions in english, still lacking in french)
  * feed the db with cleaner and richer information (like domain name, number of views, etc)
  * take into account the charset when crawling webpages
  * crawl updating process 
  * automatic grab google links to initiate a crawl
  * TBD grid-compliant code...
  * clean the code (debug mode, documentation, etc.)
  * write a comprehensive post_processing.py script to keep compatibility with other developments.
  * monitoring and reporting (page retrieval problems, success, distributions, etc)
  * modular architecture : include a better information extraction process.
  * avoid downloading n times the same content. (md5 comparison)
  * retry to download pages that could not be opened.
  * targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)

About

Python Crawler for collecting domain specific web corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%