feat: add Session binding capability via `session_id` in `Request` #1086

Mantisus · 2025-03-14T02:07:09Z

Description

Add strict binding of a Request to a specific Session. If the Session is not available in the SessionPool, an error will be raised for the Request which can be handled in the failed_request_handler.

Issues

Closes: Add support for request to use a specific session #1076

Testing

Added tests to verify functionality:

Binding to a valid session
Binding to a non-existent session
Catching error in failed_request_handler

Pijukatel

Nice, I have just two small comments.

src/crawlee/_request.py

tests/unit/crawlers/_basic/test_basic_crawler.py

src/crawlee/_request.py

vdusek

Could we please cover this in the docs? 🙏 Maybe Session management? Or find a better place. Thanks.

vdusek

That was easier than I thought, thanks, and good job! 🙂

src/crawlee/_request.py

src/crawlee/crawlers/_basic/_basic_crawler.py

barjin · 2025-03-24T14:10:05Z

I like the idea, thank you for including me!

Few concerns / ideas:

Requests are split between RQ and SessionPool

I am a bit wary of this decentralized state - the request is now effectively split between the RequestQueue (URL, headers, body) and SessionPool (Cookie header specifically). Granted, this divide was there before, but users couldn't rely on this, so the cookies couldn't have been considered required for making the request. Not sure what the better solution would be, though.

Better DX

Do we support passing session ID to requests added by enqueue_links (or other Crawlee-native methods)?

Ideally, I'd like to do something like this:

request_handler(context):
    ... 
    # I'm logged in as user A in the current request
    context.enqueue_links(session_id=context.session_id) # The crawler will visit all the child links as user A

Unstable proxy?

Maybe I'm thinking about this too much, but some proxy errors can cause a session to get retired (as ProxyError is a descendant of SessionError). Would one proxy hiccup (Apify proxies are afaik quite flaky) cause all the requests bound to the same session to fail? I do agree with @Mantisus 's reasoning (fail request on a missing session), but it still sounds like a very strict behavior (maybe that's what the users want, really).

I'm sorry to provide a fragmentary review like this, I'm sure you Python guys have thought of everything else :)

Mantisus · 2025-03-24T14:49:07Z

Do we support passing session ID to requests added by enqueue_links (or other Crawlee-native methods)?

No. Since using session_id requires additional SessionPool configuration (at least increasing the number of times a session is used, with extensive crawling). I would prefer not to give access to it via enqueue_links, this is for users who need more control. 🙂

Requests are split between RQ and SessionPool

I understand your concerns. On the other hand, if we are talking about Session, then Cookie is a logical part of that entity.

If the user doesn't want to rely on Session, they can still pass Cookie as part of the Request headers

Would one proxy hiccup (Apify proxies are afaik quite flaky) cause all the requests bound to the same session to fail?

If these are proxy errors related to timeouts and connections, the user should configure SessionPool so that the session does not die after a few errors 🙂

but it still sounds like a very strict behavior

It is exactly what I expected, in the cases for which I think it can be used

janbuchar · 2025-03-24T16:20:21Z

src/crawlee/_request.py

@@ -119,6 +123,7 @@ class RequestOptions(TypedDict):
    headers: NotRequired[HttpHeaders | dict[str, str] | None]
    payload: NotRequired[HttpPayload | str | None]
    label: NotRequired[str | None]
+    session_id: NotRequired[str | None]


Shouldn't session_id be used for unique_key computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.

CC @vdusek - you wrote a big part of the unique key functionality.

Yes, deduplication will affect this.

But I expect that users will use existing mechanisms to return a Request to the Queue avoiding deduplication. By passing either unique_key or always_enqueue=True.

Shouldn't session_id be used for unique_key computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.

Good point! Currently, it infers the unique_key from the URL, method, headers, and payload (in its extended form). You can, of course, use session_id together with always_enqueue and it will work, but that feels like a workaround to me. I believe we should include the session_id in the extended unique_key computation.

janbuchar

Some minor nits, but it's in good shape overall

docs/guides/code_examples/session_management/multi_sessions_http.py

janbuchar · 2025-03-31T10:53:42Z

src/crawlee/_utils/requests.py

@@ -114,9 +116,13 @@ def compute_unique_key(
    if use_extended_unique_key:
        payload_hash = _get_payload_hash(payload)
        headers_hash = _get_headers_hash(headers)
+        normilizead_session = '' if session_id is None else session_id.lower()


janbuchar · 2025-03-31T11:00:28Z

src/crawlee/crawlers/_basic/_basic_crawler.py

@@ -1226,3 +1250,18 @@ def _raise_for_session_blocked_status_code(self, session: Session | None, status
            ignore_http_error_status_codes=self._ignore_http_error_status_codes,
        ):
            raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
+
+    def _raise_request_collision(self, request: Request, session: Session | None) -> None:


I'd prefer to call this check_request_collision or something like that. Now the name sounds like it's always going to raise an error, which is not the case.

…tp.py Co-authored-by: Jan Buchar <[email protected]>

src/crawlee/_utils/requests.py

Mantisus added 3 commits March 13, 2025 23:30

basic implementation for bind request to session

38b5531

update docs

08a7d01

add tests

a67d3b3

Mantisus requested review from vdusek, janbuchar and Pijukatel March 14, 2025 02:07

Pijukatel reviewed Mar 14, 2025

View reviewed changes

src/crawlee/_request.py Outdated Show resolved Hide resolved

tests/unit/crawlers/_basic/test_basic_crawler.py Outdated Show resolved Hide resolved

janbuchar reviewed Mar 14, 2025

View reviewed changes

src/crawlee/_request.py Outdated Show resolved Hide resolved

Mantisus added 2 commits March 14, 2025 12:20

move session_id to crawlee_data

039f12f

resolve

88c0e8c

Mantisus self-assigned this Mar 14, 2025

Pijukatel approved these changes Mar 14, 2025

View reviewed changes

Mantisus requested a review from janbuchar March 14, 2025 16:03

vdusek requested changes Mar 18, 2025

View reviewed changes

add docs and examples

7fb5fde

Mantisus requested a review from vdusek March 20, 2025 01:36

Polishment

300cf4f

vdusek approved these changes Mar 24, 2025

View reviewed changes

src/crawlee/_request.py Outdated Show resolved Hide resolved

src/crawlee/_request.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

vdusek added 3 commits March 24, 2025 13:30

Update src/crawlee/_request.py

6037829

Update src/crawlee/_request.py

5e94912

Update src/crawlee/crawlers/_basic/_basic_crawler.py

25a21a0

janbuchar reviewed Mar 24, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Show resolved Hide resolved

janbuchar requested a review from barjin March 24, 2025 13:09

janbuchar reviewed Mar 24, 2025

View reviewed changes

Mantisus added 3 commits March 27, 2025 13:18

add session_id in extended_unique_key

14b8974

resolve

d633930

remove empty part from extended if session_id empty

b5fc9a4

Mantisus requested a review from janbuchar March 27, 2025 13:38

janbuchar reviewed Mar 31, 2025

View reviewed changes

Mantisus and others added 3 commits March 31, 2025 15:14

Update docs/guides/code_examples/session_management/multi_sessions_ht…

1a8cd22

…tp.py Co-authored-by: Jan Buchar <[email protected]>

fix typo

b5d1dc5

rename _raise_request_collision to _check_request_collision

9557ac5

janbuchar reviewed Mar 31, 2025

View reviewed changes

src/crawlee/_utils/requests.py Outdated Show resolved Hide resolved

fix typo

448f9e6

Mantisus requested a review from janbuchar March 31, 2025 13:28

janbuchar approved these changes Mar 31, 2025

View reviewed changes

vdusek merged commit cda7b31 into apify:master Mar 31, 2025
23 checks passed

feat: add Session binding capability via session_id in Request #1086

feat: add Session binding capability via session_id in Request #1086

Conversation

Mantisus commented Mar 14, 2025

Description

Issues

Testing

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barjin commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requests are split between RQ and SessionPool

Better DX

Unstable proxy?

Uh oh!

Mantisus commented Mar 24, 2025

Uh oh!

janbuchar Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: add Session binding capability via `session_id` in `Request` #1086

feat: add Session binding capability via `session_id` in `Request` #1086

barjin commented Mar 24, 2025 •

edited

Loading