You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 10, 2024. It is now read-only.
This project isn't maintained any more because their javascript rendering capability is done by phantomjs which is no longer maintained.
Like @Chaffy-0 said, Scrapy is likely the best option if you wanted to do a spider like this.
These days, elasticsearch comes paired with one if you were doing something simple and didn't need to collect and process your own data from the wild.
Most places I've done stuff @ will use things like selenium + chrome or firefox, paired with beautiful soup for the rendered html parsing. Then you could keep track of where you'd spider with simple things like a bloom filter implemented on top of redis or something.
But yeah, Scrapy if you don't feel like getting too dirty.
Activity
Chaffy-0 commentedon Dec 6, 2021
Scrapy
JermellB commentedon Dec 19, 2021
This project isn't maintained any more because their javascript rendering capability is done by phantomjs which is no longer maintained.
Like @Chaffy-0 said, Scrapy is likely the best option if you wanted to do a spider like this.
These days, elasticsearch comes paired with one if you were doing something simple and didn't need to collect and process your own data from the wild.
Most places I've done stuff @ will use things like selenium + chrome or firefox, paired with beautiful soup for the rendered html parsing. Then you could keep track of where you'd spider with simple things like a bloom filter implemented on top of redis or something.
But yeah, Scrapy if you don't feel like getting too dirty.
milahu commentedon Apr 18, 2022
some active python web scraper projects
https://github.com/Gerapy/Gerapy
https://github.com/howie6879/ruia
roniemartinez commentedon Jun 2, 2022
Just in case people will be interested in my project 🙇 : https://github.com/roniemartinez/dude