This project is a simple text classification 📝 using multi-language BERT. 🇬🇧 | 🇧🇷
Details of the BERT model used:
This project was inspired by this source:
Training and validation accuracy:
Note:
We reached 98% accuracy.
It is necessary:
Note:
A machine with a GPU is not required, but it is recommended to accelerate training.
It was used the BBC dataset to classify texts into the following labels:
- Business 💼
- Entertainment 🎬
- Sport ⚽
- Tech 💻
- Politics 🏛️
The texts in the dataset were first translated into Brazilian Portuguese, using the Google Translator API. After that, the English 🇬🇧 and Portuguese 🇧🇷 texts were combined to create a multilingual version.
Clone the project to your computer using Git and go to the project root folder:
git clone [email protected]:NeuroQuestAi/ml-text-classification.git && \
cd ml-text-classification.git
Use poetry to access the project:
poetry shell
Install all dependencies:
poetry install && poetry update
Run the model training and evaluation:
./train
This will generate the torch model in the models folder. Then just test the predictions with the command:
./predictor
Output example:
Text | Lang | Prediction |
---|---|---|
Os negócios são o tecido vital da economia... | 🇧🇷 | BUSINESS |
A variedade de formas de entretenimento reflete... | 🇧🇷 | ENTERTAINMENT |
Os valores como fair play, respeito e camaradagem são... | 🇧🇷 | SPORT |
Desde a revolução digital até as últimas descobertas... | 🇧🇷 | TECH |
A política reflete as diferentes visões, valores... | 🇧🇷 | POLITICS |
Businesses are the lifeblood of the economy, where ideas... | 🇬🇧 | BUSINESS |
From thrilling movies to engaging games, and soul-touching... | 🇬🇧 | ENTERTAINMENT |
Sports are a universal passion that brings people... | 🇬🇧 | SPORT |
Artificial intelligence, cloud computing, the Internet... | 🇬🇧 | TECH |
Active citizen participation in political life... | 🇬🇧 | POLITICS |
Note:
Model settings are in the config.json file.