Fine-tuned BERT-base-uncased pre-trained model for Indonesian-English hate comments sentiment analysis

My first project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify hate comments in a bilingual (Indonesian-English) dataset. This project focuses on sentiment analysis to detect toxic, offensive, or hateful language commonly found in social media and online platforms.

TODO

✅ Uses bert-base-multilingual-uncased, a widely used multilingual model.
✅ Clean Dataset class for handling data.
✅ Uses Hugging Face's Trainer API — very efficient.
✅ Includes training and evaluation splits.
✅ Saves the model and tokenizer.

✅ INSTALL REQUIREMENTS

Install required dependencies

pip install --upgrade pip
pip install -r requirements.txt

✅ ADD BERT virtual env

write the command below

# ✅ Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate    # On Windows use: bert-env\Scripts\activate

✅ INSTALL CUDA

Check if your GPU supports CUDA:

nvidia-smi

Then:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False

🔧 HOW TO USE

Check your device and CUDA availability:

python check_device.py

⚠️ Using CPU is not advisable, prefer check your CUDA availability.

Train the model:

python scripts/train.py

⚠️ Remove unneeded checkpoint in models/pretrained to save your storage after training

Run prediction:

python scripts/predict.py

✅ Dataset Location: data/dataset.csv, modify the dataset to enhance the model based on your needs.

License and Usage

This repository is intended for research and educational purposes only.
Commercial use is strictly prohibited.

If you are interested in commercial licensing, please contact [email protected].

Creative Commons Attribution NonCommercial (CC-BY-NC)

Leave a ⭐ if you think this project is helpful, contributions are welcome.

🚫 This repository is for research and educational purposes only. Commercial use is not allowed.

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dataset (MIX): https://github.com/abusifyid/Indonesian-Multimodal-Hate-Speech-Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
logs		logs
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_device.py		check_device.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine-tuned BERT-base-uncased pre-trained model for Indonesian-English hate comments sentiment analysis

TODO

✅ INSTALL REQUIREMENTS

✅ ADD BERT virtual env

✅ INSTALL CUDA

🔧 HOW TO USE

License and Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

fzn0x/bert-indonesian-english-hate-comments

Folders and files

Latest commit

History

Repository files navigation

Fine-tuned BERT-base-uncased pre-trained model for Indonesian-English hate comments sentiment analysis

TODO

✅ INSTALL REQUIREMENTS

✅ ADD BERT virtual env

✅ INSTALL CUDA

🔧 HOW TO USE

License and Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages