This project contains the implementation of a language model using n-grams. The language model is trained on text data and provides functionality for scoring sequences of tokens, calculating perplexity. The sentence generation is based on probability and utilizes the Shannon technique.
This Python script implements a language model using n-grams. The script provides functionality for training the model on text data, scoring sequences of tokens, calculating perplexity, and generating sentences based on the trained model.
- Python 3.x
- NumPy
-
Training the Model: To train the language model, provide a text file containing training data. Update the file path in the script to point to your training file. The script will tokenize the data and train the model.
-
Testing: Testing the model involves providing a separate text file for evaluation. Again, update the file path in the script to point to your testing file. The script will tokenize the test data and evaluate the model's performance by scoring and calculating perplexity.
-
Generating Sentences: After training the model, you can generate sentences using the
generate()
method. Specify the number of sentences you want to generate, and the script will produce them based on the trained model.
language_model.py
: The main Python script containing the implementation of the language model.training_files/berp-training.txt
: Sample training data file. Update with your own training data.testing_files/berp-test.txt
: Sample testing data file. Update with your own testing data.
- Clone or download the repository to your local machine.
- Ensure Python 3.x and NumPy are installed.
- Update the file paths in the script to point to your training and testing data.
- Run the script using
python language_model.py
.
- The script provides options for adjusting n-gram order and tokenization method (character-level or word-level).
- Performance can be further optimized, especially for large datasets, by implementing more efficient algorithms for n-gram creation and probability calculation.
The data used in this project was obtained from here. I would like to express my gratitude to the creators of the "BeRP" dataset for making it available for research and development purposes.