Improved handling of SSO-based crawling #828

ikreymer · 2025-05-05T17:54:04Z

If a seed page redirects, and then redirects back, it is likely doing some sort of SSO-check, and the initial capture should not be included.
Also add a delay after initial redirect, in case page redirects back
Don't store resource with no mime type, likely from SSO
If a page response has no mime type / non HTML, allow it to be captured again (possibly SSO 403 error that may work on second try)
Don't consider pages with no mime type as HTML
Don't add --lang when using profiles, may interfere with language expected in profile.

- if a seed page redirects to another page, and then back (such as for sso), ensure original seed is used for link extraction - don't allow direct fetch if no mime type at all - don't add --lang if using profile, display warning, as language override may invalidate profile settings - add temp extra delay if seed page redirects, to ensure any sso-related redirects finish

…ime, possibly a captcha/sso check, remove from dupe check to allow recapture

ikreymer added 3 commits May 5, 2025 10:11

direct fetch: don't allow direct fetch if no mime type provided

7f6d5d0

edge cases: check for page responses which are non-400 / or missing m…

34e1579

…ime, possibly a captcha/sso check, remove from dupe check to allow recapture

ikreymer requested a review from tw4l May 5, 2025 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved handling of SSO-based crawling #828

Improved handling of SSO-based crawling #828

Uh oh!

ikreymer commented May 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Improved handling of SSO-based crawling #828

Are you sure you want to change the base?

Improved handling of SSO-based crawling #828

Uh oh!

Conversation

ikreymer commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ikreymer commented May 5, 2025 •

edited

Loading