docs: blog How to scrape Bluesky with Python#2784
docs: blog How to scrape Bluesky with Python#2784souravjain540 merged 20 commits intoapify:masterfrom
Conversation
souravjain540
left a comment
There was a problem hiding this comment.

Pretty nice one. Add a few comments.
Also please follow this: https://www.notion.so/apify/Apify-tone-and-style-cheat-sheet-0fe6873372e44d88a1bd029d5fd76cea
Basic rules for writing, A is big in Actor always, never use Title case in titles when writing for Apify i.e, Making An BlueSky Actor Using Crawlee -> Making an BlueSky Actor using Crawlee.
and few more attached. please fix all of them.
And also please try always to link to the relevant docs/blog/resources by Apify or Crawlee wherever possible.
Missing section for GitHub Star CTA too.
|
|
||
| ### Project setup | ||
|
|
||
| 1. If you don't have UV installed yet, follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: |
There was a problem hiding this comment.
i mean i dont know as a new reader what it is and why we need it
| When first exploring Bluesky, it might be disconcerting to find that the [main page](https://bsky.app/) lacks a search function without authentication. The same applies when trying to access individual [posts](https://bsky.app/profile/github-trending-js.bsky.social/post/3ldbe7b3ict2v). | ||
|
|
||
| Even if you navigate directly to the [search page](https://bsky.app/search?q=apify), while you'll see data, you'll encounter a limitation - the site doesn't allow viewing results beyond the first page. | ||
|
|
||
| Fortunately, Bluesky provides a well-documented [API](https://docs.bsky.app/docs/get-started) that's accessible to any registered user without additional permissions. This is what we'll use for data collection. |
There was a problem hiding this comment.
maybe adding screenshots will explain more
|
|
||
| ### 5. Saving data to files | ||
|
|
||
| For saving results, we'll use the `write_to_json` method in Dataset. |
There was a problem hiding this comment.
link to the method doc
| --- | ||
|
|
||
| [Bluesky](https://bsky.app/) is an emerging social network developed by former members of the [Twitter](https://x.com/) development team. The platform has been showing significant growth recently, reaching 132.9 million visits according to [SimilarWeb](https://www.similarweb.com/website/bsky.app/#traffic). Like Twitter, Bluesky generates a vast amount of data that can be used for analysis. In this article, we'll explore how to collect this data using [Crawlee for Python](https://github.com/apify/crawlee-python). | ||
|
|
There was a problem hiding this comment.
missing part where you list all the sections of the blog from intro to making an Actor.
|
|
||
|  | ||
|
|
||
| ## Create Apify actor for Bluesky crawler |
There was a problem hiding this comment.
link to what is an Actor, and a little explanation why are we making Actor, because its the easiest way to deploy a software on cloud, etc, etc.
also loved it :)
| View results in the Dataset: | ||
|
|
||
|  | ||
|
|
There was a problem hiding this comment.
maybe also show how to publish it on Apify Store
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
|
@janbuchar whenever you have time, can you please see if its good to go or not :) |
janbuchar
left a comment
There was a problem hiding this comment.
Overall, the article is fine, maybe a bit more code-heavy than what we usually do? Other than that, I have just a couple of comments. Nice job!
|
|
||
| # Variables for storing session data | ||
| self._domain: str | None = None | ||
| self._did: str | None = None |
There was a problem hiding this comment.
This deserves a better name.
There was a problem hiding this comment.
Maybe it would look better according to the fields in the API response
| self._users = await Dataset.open(name='users') | ||
| self._posts = await Dataset.open(name='posts') |
There was a problem hiding this comment.
If you run this on Apify, the datasets will be shared with previous runs of the crawler. Perhaps you could erase them beforehand?
| # Add user request if not already added in current context | ||
| if post['author']['did'] not in user_requests: | ||
| user_requests[post['author']['did']] = Request.from_url( | ||
| url=f'{self._domain}/xrpc/app.bsky.actor.getProfile?actor={post["author"]["did"]}', |
There was a problem hiding this comment.
Perhaps you could use yarl here as well?
| Creates a crawler instance, manages the session, and handles the complete | ||
| crawling lifecycle including proper cleanup on completion or error. | ||
| """ | ||
| crawler = BlueskyCrawler() |
There was a problem hiding this comment.
I'm not a huge fan of the name BlueskyCrawler - the fact that it wraps HttpCrawler makes things a bit confusing. Also, needing to call init_crawler , save_data, etc., from the outside is not too convenient - maybe there could be just a single method on the wrapper class, should you decide to keep it?
There was a problem hiding this comment.
Changed the naming so it wouldn't be so similar to the Crawlee naming. :)
I think calling methods from the outside will make it a bit easier to understand for users unfamiliar with Crawlee.
Co-authored-by: Jan Buchar <[email protected]>
|
|
||
| data = response.json() | ||
|
|
||
| self._service_edpoint = data['didDoc']['service'][0]['serviceEndpoint'] |
| """A crawler class for extracting data from Bluesky social network using their official API. | ||
|
|
||
| This crawler manages authentication, concurrent requests, and data collection for both | ||
| posts and user profiles. It uses separate datasets for storing post and user information. | ||
| """ |
There was a problem hiding this comment.
The docblock should reflect the new name
janbuchar
left a comment
There was a problem hiding this comment.
I noticed a typo, otherwise it looks good to me.
| posts = [] | ||
|
|
||
| prfile_url = URL(f'{self._service_edpoint}/xrpc/app.bsky.actor.getProfile') | ||
| prfile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') |

new draft @souravjain540