-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: lesson about using a framework #1303
base: master
Are you sure you want to change the base?
Conversation
cb0f718
to
e18ea31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point for avoiding type hints. However, in the case of the handler:
@crawler.router.default_handler
async def handle_listing(context):
...
It leaves the reader without any possibility for code completions or static analysis when working with the context
object.
In my opinion, type hints should be included here. We have been using them across all docs & examples.
Just a suggestion for you to reconsider, not a request.
Other than that, good job 🙂, and the code seems to be working.
|
||
## Installing Crawlee | ||
|
||
When starting with the Crawlee framework, we first need to decide which approach to downloading and parsing we prefer. We want the one based on BeautifulSoup, so let's install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies, so expect the installation to take a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add link to BeautifulSoup here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader should know about BS4 from the previous lessons which work with it extensively.
|
||
From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development. | ||
|
||
We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but if you would want more reasons you can check out this PR.
Thanks for the review! I see your point and I will indeed reconsider adding the type hint, at least for the context. It would be easier decision if the type name wasn't 28 characters long, but you're right about the benefits for people with editors like VS Code, where we could assume some level of automatic code completions. |
This PR introduces a new lesson to the Python course for beginners in scraping. The lesson is about working with a framework. Decisions I made:
Crawlee feedback
Regarding Crawlee, I didn't have much trouble to write this lesson, apart from the part where I wanted to provide hints on how to do this:
I couldn't find good example in the docs, and I was afraid that even if I provided pointers to all the individual pieces, the student wouldn't be able to figure it out.
Also, I wanted to link to docs when pointing out the fact that
enqueue_links()
has alimit
argument, but I couldn't findenqueue_links()
in the docs. I found this which is weird. It's not clear what object is documented, or what it is, feels like some internals, not as regular docs of a method. I probably know how come it's this way, but I don't think it's useful this way and I decided I don't want to send people from the course to that page.One more thing: I do think that Crawlee should log some "progress" information about requests made or - especially - items scraped. It's so weird to run the program and then just look at the program as if it hanged, waiting if something happens or not. E.g. Scrapy logs how many items per minute I scraped, which I personally find super useful.