Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Session binding capability via session_id in Request #1086

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

Description

  • Add strict binding of a Request to a specific Session. If the Session is not available in the SessionPool, an error will be raised for the Request which can be handled in the failed_request_handler.

Issues

Testing

Added tests to verify functionality:

  • Binding to a valid session
  • Binding to a non-existent session
  • Catching error in failed_request_handler

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I have just two small comments.

@Mantisus Mantisus self-assigned this Mar 14, 2025
@Mantisus Mantisus requested a review from janbuchar March 14, 2025 16:03
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please cover this in the docs? 🙏 Maybe Session management? Or find a better place. Thanks.

@Mantisus Mantisus requested a review from vdusek March 20, 2025 01:36
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was easier than I thought, thanks, and good job! 🙂

Comment on lines +1111 to +1114
except RequestCollisionError as request_error:
context.request.no_retry = True
await self._handle_request_error(context, request_error)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in case of a collision, the request will be aborted, right? Unless the user re-enqueues it in the failed_request_handler, which is kinda cumbersome.

I imagine it can happen quite frequently that the session will get blocked and rotated out. I understand that silently using a new, different session might be confusing (but it might also be a viable option). Nevertheless, there should be an easier way to handle this in user code and it should be described in the documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider the case of the crawler working with authorization,

Each session is a separate authorized user.

Pages available for this user may not be available for another user. In this case, we cannot pass the request to another session. If a session becomes unavailable for some reason, then all we can do is give the user the opportunity to process these requests later.

Using this feature generally requires additional customizations to the SessionPool, so I don't think that having to use the failed_request_handler makes it any more cumbersome.

In general, I expect that users who will be using this know why they are doing it. 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky thing about failed_request_handler is that it is called after the request is marked as handled, so if you want to re-enqueue it with a different session, you need to make a new request. Are there any alternatives?

Copy link
Collaborator Author

@Mantisus Mantisus Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any alternatives?

Add a new handler for colisions? But this one already sounds cumbersome 🙂

you need to make a new request.

That's right.

I would prefer it to be like this. A user created a Request with certain parameters, part of which is session_id. This Request has failed because one of the parameters is now invalid. I would expect that if the user wants to return it to the queue, a new Request with new parameters would be created

@janbuchar janbuchar requested a review from barjin March 24, 2025 13:09
@barjin
Copy link
Contributor

barjin commented Mar 24, 2025

I like the idea, thank you for including me!

Few concerns / ideas:

Requests are split between RQ and SessionPool

I am a bit wary of this decentralized state - the request is now effectively split between the RequestQueue (URL, headers, body) and SessionPool (Cookie header specifically). Granted, this divide was there before, but users couldn't rely on this, so the cookies couldn't have been considered required for making the request. Not sure what the better solution would be, though.

Better DX

Do we support passing session ID to requests added by enqueue_links (or other Crawlee-native methods)?

Ideally, I'd like to do something like this:

request_handler(context):
    ... 
    # I'm logged in as user A in the current request
    context.enqueue_links(session_id=context.session_id) # The crawler will visit all the child links as user A

Unstable proxy?

Maybe I'm thinking about this too much, but some proxy errors can cause a session to get retired (as ProxyError is a descendant of SessionError). Would one proxy hiccup (Apify proxies are afaik quite flaky) cause all the requests bound to the same session to fail? I do agree with @Mantisus 's reasoning (fail request on a missing session), but it still sounds like a very strict behavior (maybe that's what the users want, really).

I'm sorry to provide a fragmentary review like this, I'm sure you Python guys have thought of everything else :)

@Mantisus
Copy link
Collaborator Author

Do we support passing session ID to requests added by enqueue_links (or other Crawlee-native methods)?

No. Since using session_id requires additional SessionPool configuration (at least increasing the number of times a session is used, with extensive crawling). I would prefer not to give access to it via enqueue_links, this is for users who need more control. 🙂

Requests are split between RQ and SessionPool

I understand your concerns. On the other hand, if we are talking about Session, then Cookie is a logical part of that entity.

If the user doesn't want to rely on Session, they can still pass Cookie as part of the Request headers

Would one proxy hiccup (Apify proxies are afaik quite flaky) cause all the requests bound to the same session to fail?

If these are proxy errors related to timeouts and connections, the user should configure SessionPool so that the session does not die after a few errors 🙂

but it still sounds like a very strict behavior

It is exactly what I expected, in the cases for which I think it can be used

@@ -119,6 +123,7 @@ class RequestOptions(TypedDict):
headers: NotRequired[HttpHeaders | dict[str, str] | None]
payload: NotRequired[HttpPayload | str | None]
label: NotRequired[str | None]
session_id: NotRequired[str | None]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't session_id be used for unique_key computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.

CC @vdusek - you wrote a big part of the unique key functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deduplication will affect this.

But I expect that users will use existing mechanisms to return a Request to the Queue avoiding deduplication. By passing either unique_key or always_enqueue=True.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't session_id be used for unique_key computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.

Good point! Currently, it infers the unique_key from the URL, method, headers, and payload (in its extended form). You can, of course, use session_id together with always_enqueue and it will work, but that feels like a workaround to me. I believe we should include the session_id in the extended unique_key computation.

@Mantisus Mantisus requested a review from janbuchar March 27, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for request to use a specific session
5 participants