GitHub - jphcoi/crawtext: Python Crawler for collecting domain specific web corpora

jphcoi / crawtext Public

Notifications You must be signed in to change notification settings
Fork 4
Star 8

Python Crawler for collecting domain specific web corpora

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
decruft		decruft
doc		doc
.gitignore		.gitignore
Crawtext.yaml		Crawtext.yaml
README		README
crawl_parametersyml		crawl_parametersyml
crawl_trial.py		crawl_trial.py
crawl_trial.yaml		crawl_trial.yaml
ex-algues.pdf		ex-algues.pdf
forbidden_linktext.txt		forbidden_linktext.txt
forbidden_sites.txt		forbidden_sites.txt
forbidden_sites_strong.txt		forbidden_sites_strong.txt
library.py		library.py
post-processing.py		post-processing.py
seachengine2.py		seachengine2.py

Repository files navigation

python crawl_trial.py will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of crawl_trial.py. The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages. 

required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml


TODO list:

  * better scrapping of webpages, for the moment decruft is doing great but can be enhanced (http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/)
  * automatically extract dates from extracted texts (good python solutions in english, still lacking in french)
  * feed the db with cleaner and richer information (like domain name, number of views, etc)
  * take into account the charset when crawling webpages
  * crawl updating process 
  * automatic grab google links to initiate a crawl
  * TBD grid-compliant code...
  * clean the code (debug mode, documentation, etc.)
  * write a comprehensive post_processing.py script to keep compatibility with other developments.
  * monitoring and reporting (page retrieval problems, success, distributions, etc)
  * modular architecture : include a better information extraction process.
  * avoid downloading n times the same content. (md5 comparison)
  * retry to download pages that could not be opened.
  * targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)