Don't allow arbitrary prefixes to our paths #336

PGijsbers · 2024-07-10T13:07:49Z

I updated the robots.txt in #334. Unfortunately, we still see a sizable number of crawlers stuck because of two issues (see also #335). One issue is that urls may contain arbitrary prefixes in their path, e.g. http://openml.org/not-really-something-we-want/d/151 will gladly redirect to the dataset page, instead of just going to a 404-page. As I understand it, this means that the crawlers will happily crawl these pages (in any case, crawlers do visit pages with prefixes that don't do anything). I am hoping/assuming that disallowing these arbitrary prefixes will significantly reduce traffic as there are fewer urls to explore.
I am also not sure why crawlers try to crawl these pages though, that's probably a separate issue to figure out.

PGijsbers added the bug Something isn't working label Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't allow arbitrary prefixes to our paths #336

Don't allow arbitrary prefixes to our paths #336

PGijsbers commented Jul 10, 2024

Don't allow arbitrary prefixes to our paths #336

Don't allow arbitrary prefixes to our paths #336

Comments

PGijsbers commented Jul 10, 2024