You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:
Wikidata based seed URLs will probably require some significant deduplication, filtering, reranking, etc, but here's a version of the query which adds the language of the URL to account for sites which have different base URLs for different languages, like Blick. It also expands the language list (because * doesn't work), but it could be generalized more. As an example of the type of filtering needed, the Hubei Daily item has three URLs - a corporate site, an e-paper, and a 404.
Query
As of today, there are 11,177 results. There are more than 200 languages represented, plus a couple of thousand sites with no language tag, and that distribution looks like about what you'd expect (the two letter codes represent TLDs, not language codes, eg. hk, ru, uk, de, au, cn, etc):
eng 3562
fra 826
spa 586
rus 467
deu 316
ita 177
ara 168
ukr 166
fin 152
zho 146
jpn 145
swe 140
nor 122
hk 112
ru 112
por 108
hun 103
nld 93
uk 90
de 86
kor 86
au 78
cn 78
pol 66
hin 60
bel 59
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:
The text was updated successfully, but these errors were encountered: