Skip to content

Avi197/Vietnamese-news-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2 millions news corpus for Vietnamese NLP task

News crawled from various popular Vietnamese news source: Dantri, Tuoitre, Thanhnien, Vnexpress, Vtv, Vietnamnet

All data was preprocessed: removed dupplicate, invisible space, ....

MongoDB (all information: author, images, cover, ....): ~6GB uncompressed
Download

MongoDB demo
dantri demo.png detail demo.png

title and description only (classification): ~500MB uncompress
Download

Raw text
non tokenized demo.png

Tokenized text
tokenized demo.png

Title, description, content tokenized (raw text): ~5GB uncompressed, ~1GB compressed
Download

There is a bigger news corpus by binvq with different news source, contain around 14 millions news (raw, not preprocessed), use that one if you need a lot of data
Binhvq news corpus

About

2 millions news from popular Vietnamese news sources

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published