News crawled from various popular Vietnamese news source: Dantri, Tuoitre, Thanhnien, Vnexpress, Vtv, Vietnamnet
All data was preprocessed: removed dupplicate, invisible space, ....
MongoDB (all information: author, images, cover, ....): ~6GB uncompressed
title and description only (classification): ~500MB uncompress
Title, description, content tokenized (raw text): ~5GB uncompressed, ~1GB compressed
There is a bigger news corpus by binvq with different news source, contain around 14 millions news (raw, not preprocessed), use that one if you need a lot of data
Binhvq news corpus