-
Notifications
You must be signed in to change notification settings - Fork 394
feat: add Session binding capability via session_id
in Request
#1086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I have just two small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please cover this in the docs? 🙏 Maybe Session management? Or find a better place. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was easier than I thought, thanks, and good job! 🙂
I like the idea, thank you for including me! Few concerns / ideas: Requests are split between RQ and SessionPoolI am a bit wary of this decentralized state - the request is now effectively split between the Better DXDo we support passing session ID to requests added by Ideally, I'd like to do something like this: request_handler(context):
...
# I'm logged in as user A in the current request
context.enqueue_links(session_id=context.session_id) # The crawler will visit all the child links as user A Unstable proxy?Maybe I'm thinking about this too much, but some proxy errors can cause a session to get retired (as I'm sorry to provide a fragmentary review like this, I'm sure you Python guys have thought of everything else :) |
No. Since using
I understand your concerns. On the other hand, if we are talking about If the user doesn't want to rely on
If these are proxy errors related to timeouts and connections, the user should configure
It is exactly what I expected, in the cases for which I think it can be used |
@@ -119,6 +123,7 @@ class RequestOptions(TypedDict): | |||
headers: NotRequired[HttpHeaders | dict[str, str] | None] | |||
payload: NotRequired[HttpPayload | str | None] | |||
label: NotRequired[str | None] | |||
session_id: NotRequired[str | None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't session_id
be used for unique_key
computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.
CC @vdusek - you wrote a big part of the unique key functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, deduplication will affect this.
But I expect that users will use existing mechanisms to return a Request
to the Queue
avoiding deduplication. By passing either unique_key
or always_enqueue=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't session_id be used for unique_key computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.
Good point! Currently, it infers the unique_key
from the URL, method, headers, and payload (in its extended form). You can, of course, use session_id
together with always_enqueue
and it will work, but that feels like a workaround to me. I believe we should include the session_id
in the extended unique_key
computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits, but it's in good shape overall
docs/guides/code_examples/session_management/multi_sessions_http.py
Outdated
Show resolved
Hide resolved
src/crawlee/_utils/requests.py
Outdated
@@ -114,9 +116,13 @@ def compute_unique_key( | |||
if use_extended_unique_key: | |||
payload_hash = _get_payload_hash(payload) | |||
headers_hash = _get_headers_hash(headers) | |||
normilizead_session = '' if session_id is None else session_id.lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
@@ -1226,3 +1250,18 @@ def _raise_for_session_blocked_status_code(self, session: Session | None, status | |||
ignore_http_error_status_codes=self._ignore_http_error_status_codes, | |||
): | |||
raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}') | |||
|
|||
def _raise_request_collision(self, request: Request, session: Session | None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to call this check_request_collision
or something like that. Now the name sounds like it's always going to raise an error, which is not the case.
Description
Request
to a specificSession
. If theSession
is not available in theSessionPool
, an error will be raised for theRequest
which can be handled in thefailed_request_handler
.Issues
Testing
Added tests to verify functionality:
failed_request_handler