Electra model for Named Entity Recognition (NER) with job recruitment information in Vietnam.
Welcome to watch ⭐, star ⭐, or fork 🍴.
- 📌 Introduction
- 📂 VNJob Dataset
- ⚙️ Requirements
- 📈 Results
- 💻 Usage
- 🤝 Contributing
- 📜 License
- 🔗 References
This repository contains an implementation of the Electra model for Named Entity Recognition (NER), tailored for processing job recruitment data in Vietnam.
NER plays a crucial role in automated job-matching systems by identifying and categorizing entities such as job titles, skills, locations, and salary ranges from job postings. This enhances search relevance and recommendation systems in recruitment platforms.
We chose Electra, a transformer-based model, for its efficiency in pretraining and fine-tuning tasks. This allows it to capture domain-specific linguistic patterns in Vietnamese job postings, leading to improved entity recognition performance.
The VNJob dataset consists of:
- Training set:
data/vnjob_train.csv
- Validation set:
data/vnjob_val.csv
There are 44,273 training samples and 11,086 validation samples.
🚨 No separate test set is provided. If needed, you can split the validation set (e.g., 80% for validation, 20% for testing).
The dataset contains 9 types of named entities:
- 🏷️ Job title (
job_title
) - 🏢 Job type (
job_type
) - 🏆 Position (
position
) - 🌍 City (
city
) - 🎓 Experience (
experience
) - 🛠️ Skills (
skills
) - 📌 Job fields (
job_fields
) - 💰 Salary (
salary
) - ❓ Other (
O
)
Character | Tag |
---|---|
Software | B-job_title |
Engineer | I-job_title |
at | O |
Hanoi | B-city |
This project was developed using Python with PyTorch.
📦 Install the dependencies before running the model:
pip install -r requirements.txt
torch==2.5.1
numpy==1.26.4
matplotlib==3.7.2
pathlib==1.0.1
transformers==4.47.0
datasets==3.2.0
tqdm==4.66.5
torchmetrics==1.6.0
pandas==2.0.3
The model's performance on the VNJob validation set:
Dataset | 📊 Accuracy | 🔎 Recall | 🎯 Precision | 🏆 F1 Score |
---|---|---|---|---|
Training set | 99.99 | 99.95 | 99.94 | 99.94 |
Validation set | 99.51 | 98.48 | 97.99 | 98.24 |
git clone https://github.com/tinh2044/Electra-VNJob-NER.git
cd Electra-VNJob-NER
conda create --name ElectraNER python=3.9
conda activate ElectraNER
pip install -r requirements.txt
Download the dataset from Google Drive.
Ensure the data/
folder has the following structure:
|——data
|——vnjob_train.csv
|——vnjob_val.csv
Run the following command to train the model:
python -m main --task train --epoch 200 --lr 0.001 --batch_size 32 --repo_id tinh2312/Electra-VNJob-NER
Run the following command to evaluate the trained model:
python -m main --task eval --batch_size 32 --repo_id tinh2312/Electra-VNJob-NER
Run the following command to launch the Gradio demo:
python app.py
or
gradio run app.py
🚀 Contributions are welcome!
To contribute:
- Fork this repository.
- Create a new branch:
git checkout -b feature/your-feature-name
- Make your changes and commit:
git commit -m "feat: add new preprocessing step"
- Push to your fork and submit a pull request.
For major changes, please open an issue first to discuss your proposal.
This project is licensed under the MIT License.