Skip to content

Improved handling of SSO-based crawling #828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented May 5, 2025

  • If a seed page redirects, and then redirects back, it is likely doing some sort of SSO-check, and the initial capture should not be included.
  • Also add a delay after initial redirect, in case page redirects back
  • Don't store resource with no mime type, likely from SSO
  • If a page response has no mime type / non HTML, allow it to be captured again (possibly SSO 403 error that may work on second try)
  • Don't consider pages with no mime type as HTML
  • Don't add --lang when using profiles, may interfere with language expected in profile.

ikreymer added 3 commits May 5, 2025 10:11
- if a seed page redirects to another page, and then back (such as for sso), ensure original seed is used for link extraction
- don't allow direct fetch if no mime type at all
- don't add --lang if using profile, display warning, as language override may invalidate profile settings
- add temp extra delay if seed page redirects, to ensure any sso-related redirects finish
…ime, possibly a captcha/sso check,

remove from dupe check to allow recapture
@ikreymer ikreymer requested a review from tw4l May 5, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant