Skip to content

docs: blog How to scrape Bluesky with Python#2784

Merged
souravjain540 merged 20 commits intoapify:masterfrom
Mantisus:blog-bsky-crawler
Mar 21, 2025
Merged

docs: blog How to scrape Bluesky with Python#2784
souravjain540 merged 20 commits intoapify:masterfrom
Mantisus:blog-bsky-crawler

Conversation

@Mantisus
Copy link
Contributor

new draft @souravjain540

Copy link
Collaborator

@souravjain540 souravjain540 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-02-17 at 11 42 01 PM
Pretty nice one. Add a few comments.

Also please follow this: https://www.notion.so/apify/Apify-tone-and-style-cheat-sheet-0fe6873372e44d88a1bd029d5fd76cea

Basic rules for writing, A is big in Actor always, never use Title case in titles when writing for Apify i.e, Making An BlueSky Actor Using Crawlee -> Making an BlueSky Actor using Crawlee.

and few more attached. please fix all of them.

And also please try always to link to the relevant docs/blog/resources by Apify or Crawlee wherever possible.

Missing section for GitHub Star CTA too.


### Project setup

1. If you don't have UV installed yet, follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little about UV?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean i dont know as a new reader what it is and why we need it

Comment on lines +53 to +57
When first exploring Bluesky, it might be disconcerting to find that the [main page](https://bsky.app/) lacks a search function without authentication. The same applies when trying to access individual [posts](https://bsky.app/profile/github-trending-js.bsky.social/post/3ldbe7b3ict2v).

Even if you navigate directly to the [search page](https://bsky.app/search?q=apify), while you'll see data, you'll encounter a limitation - the site doesn't allow viewing results beyond the first page.

Fortunately, Bluesky provides a well-documented [API](https://docs.bsky.app/docs/get-started) that's accessible to any registered user without additional permissions. This is what we'll use for data collection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe adding screenshots will explain more


### 5. Saving data to files

For saving results, we'll use the `write_to_json` method in Dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to the method doc

---

[Bluesky](https://bsky.app/) is an emerging social network developed by former members of the [Twitter](https://x.com/) development team. The platform has been showing significant growth recently, reaching 132.9 million visits according to [SimilarWeb](https://www.similarweb.com/website/bsky.app/#traffic). Like Twitter, Bluesky generates a vast amount of data that can be used for analysis. In this article, we'll explore how to collect this data using [Crawlee for Python](https://github.com/apify/crawlee-python).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing part where you list all the sections of the blog from intro to making an Actor.


![Users Example](./img/users.webp)

## Create Apify actor for Bluesky crawler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to what is an Actor, and a little explanation why are we making Actor, because its the easiest way to deploy a software on cloud, etc, etc.

also loved it :)

View results in the Dataset:

![Dataset Results](img/actor_results.webp)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also show how to publish it on Apify Store

@souravjain540 souravjain540 marked this pull request as ready for review March 12, 2025 06:47
@souravjain540 souravjain540 requested a review from janbuchar March 12, 2025 06:48
@souravjain540
Copy link
Collaborator

@janbuchar whenever you have time, can you please see if its good to go or not :)

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the article is fine, maybe a bit more code-heavy than what we usually do? Other than that, I have just a couple of comments. Nice job!


# Variables for storing session data
self._domain: str | None = None
self._did: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would look better according to the fields in the API response

Comment on lines +184 to +185
self._users = await Dataset.open(name='users')
self._posts = await Dataset.open(name='posts')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run this on Apify, the datasets will be shared with previous runs of the crawler. Perhaps you could erase them beforehand?

# Add user request if not already added in current context
if post['author']['did'] not in user_requests:
user_requests[post['author']['did']] = Request.from_url(
url=f'{self._domain}/xrpc/app.bsky.actor.getProfile?actor={post["author"]["did"]}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could use yarl here as well?

Creates a crawler instance, manages the session, and handles the complete
crawling lifecycle including proper cleanup on completion or error.
"""
crawler = BlueskyCrawler()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of the name BlueskyCrawler - the fact that it wraps HttpCrawler makes things a bit confusing. Also, needing to call init_crawler , save_data, etc., from the outside is not too convenient - maybe there could be just a single method on the wrapper class, should you decide to keep it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the naming so it wouldn't be so similar to the Crawlee naming. :)

I think calling methods from the outside will make it a bit easier to understand for users unfamiliar with Crawlee.

@janbuchar
Copy link
Contributor

Looks like there are some broken links - can you fix them?

image


data = response.json()

self._service_edpoint = data['didDoc']['service'][0]['serviceEndpoint']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in attribute name

Comment on lines +121 to +125
"""A crawler class for extracting data from Bluesky social network using their official API.

This crawler manages authentication, concurrent requests, and data collection for both
posts and user profiles. It uses separate datasets for storing post and user information.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docblock should reflect the new name

@Mantisus Mantisus requested a review from janbuchar March 18, 2025 12:16
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a typo, otherwise it looks good to me.

posts = []

prfile_url = URL(f'{self._service_edpoint}/xrpc/app.bsky.actor.getProfile')
prfile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo 🙂

@souravjain540 souravjain540 merged commit f7f9728 into apify:master Mar 21, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants