Hidden Markov Model For Parts Of Speech Tagging
Part-of-Speech tagging is a common sequence-taggin problem in Natual Language Processing. It is the process of assigning a single word POS tag to each token/word in the input sentence.
For example, for the input : From the AP comes this story
the output of the tagger is From/IN the/DT AP/NNP comes/VBZ this/DT story/NN
, in which the each POS tag escribes what its corresponding word is about. In this particular example,DT
tag tells that the
is a determiner
.
There are numerous approaches for parts-of-speech tagging including rule-based linguistic, stochastic and machine learning approaches.
In this post, we will discuss the part-of-speech tagging using the Hidden Markov Models and Viterbi algorithm. Along with this, we will try to implement a basic POS tagger in Python.
We will divide our POS implementation in two separate phases:
- Learning/Training of the HMM Model
- Assigning POS tags for the unseen sentences(Testing)
Now, before actually implementing these two phases, let’s understand the two phases one by one.
1. Learning/Training of the HMM Model.
Our HMM Model is composed of following elements:
- State Tranisitioning Probability Matrix (A)
- Word Emission/Observation Probability Matrix (B)
- Initial Probabilities: probability of each tag associated to the first word
- Hidden State of HMM i.e. POS tags like
DT
,NN
,NNP
,VB
etc.
The entire working code can be found at the Github Repo HMM-Part-of-Speech-Tagging.