A simple web crawler developed upon scrapy with some degree of extensibility.
The crawler is able to handle this simplest use case:
- target webpages are linearly constructed, which means there is no need to crawl a subpage.
- given a webpage, it can be determined that whether the crawling process should continue.
- given a webpage, url to next page, if any, can be determined.
A sample parser which crawls basic project information from https://pro.lagou.com/project/kaifa is included in the source
(at SimpleCrawler/cores/dakun.py). Theoretically any module that conforms to following prototype would work.
- lays in the folder
cores - has a class
Pagein it - the class
Pagecan be constructed with ascrapy.Response - the class
Pageshould have a static variablestart_urlholds the url of starting page. - An instance of class
Pageshould have:- a
has_nextfield indicates whether the crawling process should continue - a
nextfield holds an url of next page, if any. - an
outputfield holds a piece of record asdict
- a
scrapy crawl simple -a core=<website_parser_module> -o <output_file_name>.<json|cvs|jl|xml> [-L INFO]
-L INFO is to eliminate the verbose DEBUG logging, if the logging level is set to DEBUG(default value) all exports would be
logged.
jl stands for json line. The difference between file.jl and file.json can be briefly stated that jl file contains a bunch of json objects as seperate lines
while json joins them as an array of objects. jl file may be found convinent for incremental recording while 'json' is easy to read by existing json decoders.
more information in scrapy docs
for example:
scrapy crawl simple -a core=dakun -o o.json -L INFO
dakun can be changed to any compatible parser.