-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Session binding capability via session_id
in Request
#1086
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I have just two small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please cover this in the docs? 🙏 Maybe Session management? Or find a better place. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was easier than I thought, thanks, and good job! 🙂
except RequestCollisionError as request_error: | ||
context.request.no_retry = True | ||
await self._handle_request_error(context, request_error) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in case of a collision, the request will be aborted, right? Unless the user re-enqueues it in the failed_request_handler
, which is kinda cumbersome.
I imagine it can happen quite frequently that the session will get blocked and rotated out. I understand that silently using a new, different session might be confusing (but it might also be a viable option). Nevertheless, there should be an easier way to handle this in user code and it should be described in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we consider the case of the crawler working with authorization,
Each session is a separate authorized user.
Pages available for this user may not be available for another user. In this case, we cannot pass the request to another session. If a session becomes unavailable for some reason, then all we can do is give the user the opportunity to process these requests later.
Using this feature generally requires additional customizations to the SessionPool
, so I don't think that having to use the failed_request_handler
makes it any more cumbersome.
In general, I expect that users who will be using this know why they are doing it. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky thing about failed_request_handler
is that it is called after the request is marked as handled, so if you want to re-enqueue it with a different session, you need to make a new request. Are there any alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any alternatives?
Add a new handler for colisions? But this one already sounds cumbersome 🙂
you need to make a new request.
That's right.
I would prefer it to be like this. A user created a Request
with certain parameters, part of which is session_id
. This Request
has failed because one of the parameters is now invalid. I would expect that if the user wants to return it to the queue, a new Request
with new parameters would be created
I like the idea, thank you for including me! Few concerns / ideas: Requests are split between RQ and SessionPoolI am a bit wary of this decentralized state - the request is now effectively split between the Better DXDo we support passing session ID to requests added by Ideally, I'd like to do something like this: request_handler(context):
...
# I'm logged in as user A in the current request
context.enqueue_links(session_id=context.session_id) # The crawler will visit all the child links as user A Unstable proxy?Maybe I'm thinking about this too much, but some proxy errors can cause a session to get retired (as I'm sorry to provide a fragmentary review like this, I'm sure you Python guys have thought of everything else :) |
No. Since using
I understand your concerns. On the other hand, if we are talking about If the user doesn't want to rely on
If these are proxy errors related to timeouts and connections, the user should configure
It is exactly what I expected, in the cases for which I think it can be used |
@@ -119,6 +123,7 @@ class RequestOptions(TypedDict): | |||
headers: NotRequired[HttpHeaders | dict[str, str] | None] | |||
payload: NotRequired[HttpPayload | str | None] | |||
label: NotRequired[str | None] | |||
session_id: NotRequired[str | None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't session_id
be used for unique_key
computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.
CC @vdusek - you wrote a big part of the unique key functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, deduplication will affect this.
But I expect that users will use existing mechanisms to return a Request
to the Queue
avoiding deduplication. By passing either unique_key
or always_enqueue=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't session_id be used for unique_key computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.
Good point! Currently, it infers the unique_key
from the URL, method, headers, and payload (in its extended form). You can, of course, use session_id
together with always_enqueue
and it will work, but that feels like a workaround to me. I believe we should include the session_id
in the extended unique_key
computation.
Description
Request
to a specificSession
. If theSession
is not available in theSessionPool
, an error will be raised for theRequest
which can be handled in thefailed_request_handler
.Issues
Testing
Added tests to verify functionality:
failed_request_handler