Skip to content

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

sebastian-nagel
Copy link
Contributor

@sebastian-nagel sebastian-nagel commented Dec 4, 2024

  • strip the userinfo from the authority only for HTTP and HTTPS

…ght require it

- strip the userinfo from the authority only for HTTP and HTTPS
@sebastian-nagel sebastian-nagel changed the title NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which mi…ght require it NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it Dec 4, 2024
@HiranChaudhuri
Copy link
Contributor

Does it make sense to decide stripping authority data based on the protocol? I acknowledge most users want to scan the internet anonymously. But intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved while they have no control over the protocol. Thus I suspect sometimes it may be required even though https is used.

How about making it configurable, maybe via regexp? This would allow Nutch users to define the protocol, or the site or ... where to preserve the authority.

@sebastian-nagel
Copy link
Contributor Author

intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved

@HiranChaudhuri, I understand your argument. Thanks!

However, let's keep it simple here. If the authority parts are really required, the simple solution would be to disable the basic URL normalizer by removing the plugin from the plugin.includes. When crawling the intranet or a specific site, strict URL normalization is less a requirement than for a broad web crawl. Of course, this means you need to have a separate configuration for the intranet crawl. But from my experience, this is often already necessary because of other specific configuration options, e.g. a different revisit schedule, etc.

@sebastian-nagel sebastian-nagel merged commit a077ffc into apache:master Jul 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants