NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

sebastian-nagel · 2024-12-04T19:31:59Z

strip the userinfo from the authority only for HTTP and HTTPS

…ght require it - strip the userinfo from the authority only for HTTP and HTTPS

HiranChaudhuri · 2024-12-06T06:33:39Z

Does it make sense to decide stripping authority data based on the protocol? I acknowledge most users want to scan the internet anonymously. But intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved while they have no control over the protocol. Thus I suspect sometimes it may be required even though https is used.

How about making it configurable, maybe via regexp? This would allow Nutch users to define the protocol, or the site or ... where to preserve the authority.

sebastian-nagel · 2025-07-09T21:18:57Z

intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved

@HiranChaudhuri, I understand your argument. Thanks!

However, let's keep it simple here. If the authority parts are really required, the simple solution would be to disable the basic URL normalizer by removing the plugin from the plugin.includes. When crawling the intranet or a specific site, strict URL normalization is less a requirement than for a broad web crawl. Of course, this means you need to have a separate configuration for the intranet crawl. But from my experience, this is often already necessary because of other specific configuration options, e.g. a different revisit schedule, etc.

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which mi…

df115cb

…ght require it - strip the userinfo from the authority only for HTTP and HTTPS

sebastian-nagel changed the title ~~NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which mi…ght require it~~ NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it Dec 4, 2024

sebastian-nagel merged commit a077ffc into apache:master Jul 9, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

Uh oh!

sebastian-nagel commented Dec 4, 2024 •

edited

Loading

Uh oh!

HiranChaudhuri commented Dec 6, 2024

Uh oh!

sebastian-nagel commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

NUTCH-3087 BasicURLNormalizer to keep userinfo for protocols which might require it #845

Uh oh!

Conversation

sebastian-nagel commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HiranChaudhuri commented Dec 6, 2024

Uh oh!

sebastian-nagel commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel commented Dec 4, 2024 •

edited

Loading