Parts of Speech Tagging (using Hidden Markov Models (HMM) -Viterbi Algorithm)

In this project, We will see implementation of viterbi algorithm, (enhancement on HMM which reduce exponential to polynomial time complexity) to perform parts of speech tagging of given sentences.

Given:
No. of Parts of speech tags = N
No. of Tokens per sentence = L

[Bruteforce Approach] O(N^L) `==>` O(L * N²) [Viterbi Algoritm]

Data

Train file: It consists of tagged training data in word/TAG format, with words(tokens) seperated by spaces and each sentence on new line.
Test file: It consist of untagged data, which to be tested on trained model, with words(tokens) seperated by spaces and each sentence on new line.
Test Results file: It consist of true tagged data (which to be used for test score evalute purpose) with in word/TAG format, with words(tokens) seperated by spaces and each sentence on new line.

NOTE: If training file not provided then by default NLTK Brown Corpus will be taken as training data and NLTK Universal Tagset will be used.

Installation

From GitHub

$ git clone https://github.com/parth-np/Parts-of-Speech-Tagging.git
$ cd  POS_Tagging
$ python3 -m venv pos
$ source venv/bin/activate  
$ pip install -r requirements.txt

Usage

Train the Model

$ python hmm_learn.py -h
usage: hmm_learn.py [-h] [-i INPUT_FILE] [-m MODEL_OUTPUT_FILE] [-v]

Train model for Parts of Speech Tagging.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE,        --input_file INPUT_FILE
                        Input Train file. (Default: NLTK Brown Corpus)
  -m MODEL_OUTPUT_FILE, --model_output_file MODEL_OUTPUT_FILE
                        Model file Path to save model
  -v, --verbose         increase output verbosity (Set Log Level: INFO)

With Training data

$ python hmm_learn.py -i corpus/train_tagged.txt -m ./train_model/custom_pos_model.h5 -v

Without Training data (Using Default NLTK Brown corpus as training data)

$ python hmm_learn.py -m train_model/brown_pos_model.h5 -v 
INFO:__main__:Loading Train Data....
INFO:__main__:Downloading NLTK 'brown' Corpus and 'universal-tagset' for POS Tagging
[nltk_data] Downloading package brown to /home/parth/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/parth/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
INFO:__main__:Calculating Transition Probabilities...
100%|███████████████████████████████████████████| 14/14 [00:05<00:00,  2.75it/s]
INFO:__main__:Calculating Emission Probabilities...
100%|███████████████████████████████████████████| 14/14 [00:17<00:00,  1.23s/it]
INFO:__main__:Saving model to train_model/pos_model.h5

Test the Model

$ python hmm.py -h
usage: hmm.py [-h] [-o OUTPUT] [-f] [-m MODEL] [-v] sentences

positional arguments:
  sentences             Sentence OR Test file path contaning list of input
                        sentences (Each at newline)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Test Output file
  -f, --from_file       Read Input Sentences to train from a file
  -m MODEL, --model MODEL
                        Model file path
  -v, --verbose         increase output verbosity

Input: Test file

$ python hmm.py data/test_sents.txt -f -o test_op.txt  -m train_model/custom_pos_model.h5 -v

NOTE: For POS-Tags Refer: Penn Treebank Tagset

Example Capture Context (word:'race')

$ python hmm.py "People continue to inquire the reason for the race for the outer space ."  -m ./train_model/custom_pos_model.h5  

People/NNPS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN the/DT outer/JJ space/NN ./.


$ python hmm.py "James is expected to race tomorrow ."  -m ./train_model/custom_pos_model.h5 

James/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN ./.

Evaluate Model

$ python evaluate.py -h
usage: evaluate.py [-h] -t TEST_OUTPUT -c CORRECT_OUTPUT

optional arguments:
  -h, --help            show this help message and exit
  -t TEST_OUTPUT, --test_output TEST_OUTPUT
                        Test Output file contains tagged sentences generated
                        by Model
  -c CORRECT_OUTPUT, --correct_output CORRECT_OUTPUT
                        Test Output file contains correct tagged sentence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parts of Speech Tagging (using Hidden Markov Models (HMM) -Viterbi Algorithm)

[Bruteforce Approach] O(N^L) `==>` O(L * N²) [Viterbi Algoritm]

Data

Installation

From GitHub

Usage

Train the Model

With Training data

Without Training data (Using Default NLTK Brown corpus as training data)

Test the Model

Input: Test file

Example Capture Context (word:'race')

Evaluate Model

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
train_model		train_model
README.md		README.md
evaluate.py		evaluate.py
hmm.py		hmm.py
hmm_learn.py		hmm_learn.py
requirements.txt		requirements.txt

parth-gm/POS_Tagging

Folders and files

Latest commit

History

Repository files navigation

Parts of Speech Tagging (using Hidden Markov Models (HMM) -Viterbi Algorithm)

[Bruteforce Approach] O(NL) ==> O(L * N2) [Viterbi Algoritm]

Data

Installation

From GitHub

Usage

Train the Model

With Training data

Without Training data (Using Default NLTK Brown corpus as training data)

Test the Model

Input: Test file

Example Capture Context (word:'race')

Evaluate Model

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

[Bruteforce Approach] O(N^L) `==>` O(L * N²) [Viterbi Algoritm]

Packages