You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I updated the robots.txt in #334. Unfortunately, we still see a sizable number of crawlers stuck because of two issues (see also #335). One issue is that urls may contain arbitrary prefixes in their path, e.g. http://openml.org/not-really-something-we-want/d/151 will gladly redirect to the dataset page, instead of just going to a 404-page. As I understand it, this means that the crawlers will happily crawl these pages (in any case, crawlers do visit pages with prefixes that don't do anything). I am hoping/assuming that disallowing these arbitrary prefixes will significantly reduce traffic as there are fewer urls to explore.
I am also not sure why crawlers try to crawl these pages though, that's probably a separate issue to figure out.
The text was updated successfully, but these errors were encountered:
I updated the robots.txt in #334. Unfortunately, we still see a sizable number of crawlers stuck because of two issues (see also #335). One issue is that urls may contain arbitrary prefixes in their path, e.g. http://openml.org/not-really-something-we-want/d/151 will gladly redirect to the dataset page, instead of just going to a 404-page. As I understand it, this means that the crawlers will happily crawl these pages (in any case, crawlers do visit pages with prefixes that don't do anything). I am hoping/assuming that disallowing these arbitrary prefixes will significantly reduce traffic as there are fewer urls to explore.
I am also not sure why crawlers try to crawl these pages though, that's probably a separate issue to figure out.
The text was updated successfully, but these errors were encountered: