Open source crawler for Persian websites. Crawled websites to now:
asriran/run_asriran.shYou can change some paramters in this crawler. See
run_asriran.sh.
Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.
wikipedia/run_wikipedia.shThis crawler saves tasnim news pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.
We have a parameter Called
Number_of_pagesintasnim.pywhich controls how many pages we should crawl in each category.
tasnim/run_tasnim.shDatasets are all available for download at Kaggle.
CSS selectors are mostly extracted via Copy Css Selector.