Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow arbitrary prefixes to our paths #336

Open
PGijsbers opened this issue Jul 10, 2024 · 0 comments
Open

Don't allow arbitrary prefixes to our paths #336

PGijsbers opened this issue Jul 10, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@PGijsbers
Copy link
Contributor

I updated the robots.txt in #334. Unfortunately, we still see a sizable number of crawlers stuck because of two issues (see also #335). One issue is that urls may contain arbitrary prefixes in their path, e.g. http://openml.org/not-really-something-we-want/d/151 will gladly redirect to the dataset page, instead of just going to a 404-page. As I understand it, this means that the crawlers will happily crawl these pages (in any case, crawlers do visit pages with prefixes that don't do anything). I am hoping/assuming that disallowing these arbitrary prefixes will significantly reduce traffic as there are fewer urls to explore.
I am also not sure why crawlers try to crawl these pages though, that's probably a separate issue to figure out.

@PGijsbers PGijsbers added the bug Something isn't working label Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant