N-Gram Language Model

Objective

This project contains the implementation of a language model using n-grams. The language model is trained on text data and provides functionality for scoring sequences of tokens, calculating perplexity. The sentence generation is based on probability and utilizes the Shannon technique.

Overview

This Python script implements a language model using n-grams. The script provides functionality for training the model on text data, scoring sequences of tokens, calculating perplexity, and generating sentences based on the trained model.

Requirements

Python 3.x
NumPy

Usage

Training the Model: To train the language model, provide a text file containing training data. Update the file path in the script to point to your training file. The script will tokenize the data and train the model.
Testing: Testing the model involves providing a separate text file for evaluation. Again, update the file path in the script to point to your testing file. The script will tokenize the test data and evaluate the model's performance by scoring and calculating perplexity.
Generating Sentences: After training the model, you can generate sentences using the generate() method. Specify the number of sentences you want to generate, and the script will produce them based on the trained model.

Files

language_model.py: The main Python script containing the implementation of the language model.
training_files/berp-training.txt: Sample training data file. Update with your own training data.
testing_files/berp-test.txt: Sample testing data file. Update with your own testing data.

How to Run

Clone or download the repository to your local machine.
Ensure Python 3.x and NumPy are installed.
Update the file paths in the script to point to your training and testing data.
Run the script using python language_model.py.

Examples

Additional Notes

The script provides options for adjusting n-gram order and tokenization method (character-level or word-level).
Performance can be further optimized, especially for large datasets, by implementing more efficient algorithms for n-gram creation and probability calculation.

Credits

The data used in this project was obtained from here. I would like to express my gratitude to the creators of the "BeRP" dataset for making it available for research and development purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
testing_files		testing_files
training_files		training_files
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
language_model.py		language_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N-Gram Language Model

Objective

Overview

Requirements

Usage

Files

How to Run

Examples

Additional Notes

Credits

About

Releases

Packages

Languages

prajwalvathreya/N_gram_language_model

Folders and files

Latest commit

History

Repository files navigation

N-Gram Language Model

Objective

Overview

Requirements

Usage

Files

How to Run

Examples

Additional Notes

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages