You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to pass in user agent, connect timeout etc. with the varioius drivers. Perhaps, also check if robots.txt allows spidering.
Such opts can be handled well in curl, I am unaware of the rest.
Hi
nice to see some interest in this library :) It was mainly developed to facilitate testing not crawling, so I didn't really have those concerns.
All the drivers already support setting user_agent so thats one thing crossed from your list.
You can easily add a method to pass arbitrary curl options in the class you referenced, and make a pull request out of it.
Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.
Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.
Meanwhile, I notice there are plenty of robots.txt classes on github... I might just throw sg together and run with it.
Hello,
I would like to pass in user agent, connect timeout etc. with the varioius drivers. Perhaps, also check if robots.txt allows spidering.
Such opts can be handled well in curl, I am unaware of the rest.
Re RequestFacory https://github.com/OpenBuildings/spiderling/blob/3f2da1a3bc6b8a7b48639ce159e3668ae65e10b8/src/Openbuildings/Spiderling/Driver/Simple/RequestFactory/HTTP.php
The text was updated successfully, but these errors were encountered: