Rakesh Dugad and U. B. Desai
Signal Processing and Artificial Neural Networks Laboratory
Department of Electrical Engineering
Indian Institute of Technology --- Bombay
Powai, Bombay 400 076,
India
Contact Email : dugad@uiuc.edu, ubdesai@ee.iitb.ernet.in
Technical Report No. : SPANN-96.1
May 1996
Suppose a person has say three coins and is sitting inside a room tossing them in some sequence--this room is closed and what you are shown (on a display outside the room) is only the outcomes of his tossings TTHTHHTT... this will be called the observation sequence . You do not know the sequence in which he is tossing the different coins, nor do you know the bias of the various coins. To appreciate how much the outcome depends on the individual biasing and the order of tossing the coins, suppose you are given that the third coin is highly biased to produce heads and all coins are tossed with equal probability. Then, we naturally expect there to be far greater number of heads than tails in the output sequence. Now if it be given that besides the bias the probability of going to the third coin (state) from either the first or the second coin (state) is zero; then assuming that we were in the first or second state to begin with the heads and tails will appear with almost equal probability inspite of the bias. So we see that the output sequence depends very much on the individual bias, the transition probabilities between various states, as well as on which state is chosen to begin the observations. The three sets, namely, the set of individual bias of the three coins, the set of transition probabilities from one coin to the next and the set of initial probabilities of choosing the states characterize what is called as the HIDDEN MARKOV MODEL(HMM) for this coin tossing experiment. We shall shortly see the problems and their solutions that come under the framework of HMMs.
In the above experiment the outcomes, of tossing of each coin, T or H are called the observation symbols. Because we considered coins, only two observation symbols were possible. To be more general, consider a set of N urns each consisting of a number of marbles. The marbles are of M distinct colors. Within each urn the marbles are of distinct colors. Our experiment consists of drawing marbles from these urns in some sequence; only the sequence of marbles drawn is shown to us. Marbles, here, correspond to the T or H and urns to coins in the above experiment. We now define the following notation for our model :
denotes the state in which we are at time t
,...,
} the discrete set of possible observation symbols
= P(
= i), the
probability of being in state i at the beginning of the
experiment i.e. at t=1
} where
= P(
=j
= i),
the probability of being in state j at time t+1 given that we were in state i at time t.
We assume that
's are independent of time.
( k)},
( k) = P(
at t
= j),
the probability of observing the symbol
given that we are in state j.
will denote the observation symbol observed at instant t
will be used as a compact notation to denote an HMM
Using the model, an observation sequence O =
,
,...,
is generated as follows :
We start our experiment by choosing one of the urns (according to the initial
probability distribution
), then we choose a marble (observation symbol) from this
urn -- this beginning instant of time is taken as t=1 and the state and observation symbol chosen
at this t=1 are denoted by
and
respectively. After this we choose an urn (may be same or different from
the urn at t=1) according to the transition probability distribution A and again select a
marble (denoted by
) from this urn depending on the observation symbol probability distribution
( k) for that urn (state).
Continuing this upto time t=T, generates the observation sequence O =
,
,...,
.
Most applications of HMMs are finally reduced to solving three main problems. These are :
how do we compute P( O
), the probability of occurrence of the observation sequence O =
,
,...,
.
how do we choose a state sequence I =
,
,...,
so that P( O,I
),the joint
probability of the observation sequence O =
,
,...,
and the state sequence given the model is maximized.
so that P( O
) (or P( O,I
) ) is maximized.
Problems 1 and 2 can be viewed as analysis problems while Problem 3 is a typical synthesis (or model identification or training) problem.
A most straightforward way to determine P( O
) is to find P(O
) for a fixed state sequence I =
,
,...,
then multiply it by P( I
) and then sum up over all possible I's. We have

Hence we have:
where I =
,
,...,
.
From Eq. 4 we see that the summand of this equation involves 2 T--1
multiplications and there exists
distinct possible state sequences I. Hence a direct computation from
( 4) will involve of the order of 2 T
multiplications. Even for small values, N =5 and
T=100, this means approximately
multiplications which would take aeons to complete even
for a supercomputer. Hence we see that a more efficient procedure is required to solve Problem 1;
such a procedure exists and is called the forward-backward procedure.
Consider the forward variable
( i) defined as :
i.e the probability of the partial observation sequence upto time t and the state i at time t,
given the model
.
( i) can be computed inductively as follows:
) independently from any of
the N states at time t. The summation in Eq. 7 refers
to this fact. Also the summand gives observation sequence upto time
t; hence the
(
) outside the brackets. In Step 3 we just sum
up all possible (independent) ways of realizing the given observation
sequence. Let us illustrate this by taking the case of

(We have dropped dependence on
for convenience).
The sequence
,
,
=j occurs as: first
, then state
, then object
. But
can occur through any of the following mutually exclusive and
exhaustive ways : state 1 then
, state 2 then
, and so on upto state N. We know that if {
} be a set of
mutually exclusive and exhaustive events then for any event E we
have
Hence here we can write

which is the same as (7) above. Hence this simple example gives an insight into the mathematical justification of (7).
As for (8) using (9 ) we can write

Hence proved.
Now let us examine the number of computations involved in this algorithm. Step 1
involves N multiplications. In Step 2 the summation involves N multiplications plus one for the out of bracket
(
) term---this
has to be done for j=1 to N and t = 1 to T--1, making the total number of multiplications in Step 2
as (N+1)N(T--1). Step 3 involves no multiplications. Hence total number of multiplications is
N+N(N+1)(T--1) i.e. of the order of
as compared to
required for the direct method
.
For N=5 and T=100 we need about
3000 computations for the forward method as compared to
required by the
direct method--a saving of about 69 orders of magnitude!
In a similar manner we may define a backward variable
( i) as:
i.e. the probability of the observation sequence from t+1 to T given the state i
at time t and the model
. Note that here
=i has already been given (it wasn't in the case of the forward variable).
This distinction has been made to be able to combine the forward and the backward variables to produce useful results, as we shall
see soon. We can easily solve for
( i) as done for
( i) :
The proof of (12) and (13) is similar to the one given for (7) and (8) and is left as an exercise for the reader.
The computation of P( O
) using
( i) also involves of the
order of
calculations. Hence both the forward as well as the backward method is
equally efficient for the computation of P( O
).
This solves Problem 1.
Here we have to find a state sequence I =
,
,...,
such that probability of
occurrence of the observation sequence O =
,
,...,
from this state sequence
is greater than that from any other state sequence. In other words,
our problem is to find I that will maximize P( O,I
). There is a
famous algorithm to do this, called the Viterbi Algorithm\
[2]. It is an inductive algorithm in which at each
instant you keep the best (i.e. the one giving maximum probability)
possible state sequence for each of the N states as the intermediate
state for the desired observation sequence O =
,
,...,
. In this way you
finally have the best path for each of the N states as the last state
for the desired observation sequence. Out of these, we select the one
which has highest probability.
In order to get an idea of the Viterbi algorithm as applied to the optimum state estimation problem a simple reformulation of the problem will be useful
Consider the expression for
; from (1)-(2) we
have

Now define

then it is easily seen that

consequently the problem of optimal state estimation, namely,

becomes equivalent to

This reformulation now enables us to view terms like
as the cost associated in
going form state
to state
at time t.
The Viterbi algorithm to find the optimum state sequence can now be
described as follows:
Suppose we are currently in state i and we are considering visiting
state j next. We shall say that the weight on the path from state
i to state j is
(i.e. the negative of
logarithm of probability of going
from state i to state j and selecting the observation symbol
in state j) where
is the
observation symbol selected after visiting state j--this is the same
symbol that appears in the given observation sequence O =
,
,...,
.When the
initial state is selected as state i the corresponding weight is
-- we shall call this the initial weight. We define
the weight of a sequence of states as the sum of the weights
on the adjacent states. Note that this actually corresponds to
multiplying the corresponding probabilities. Now finding the optimum
sequence is merely a
matter of finding the path (i.e. a sequence of states) of minimum
weight through which the given observation sequence occurs.
Note that the Viterbi Algorithm is essentially a dynamic programming
approach for minimizing U(
,
,...,
).
We shall illustrate with an example on how the minimization is done. Consider a three state HMM. Fig.1(a) shows the weights assigned to various paths as described above. The numbers in circles show the initial weights assigned to the corresponding states. To simplify understanding of the algorithm we have assumed that the weights are symmetrical (i.e. the weight from state i to state j is same as the weight from state j to state i). Also we have assumed the weights do not change with time which in general will not be true but then once this simplified case is understood the extension to practical case is straight forward.
Figure 1: (a) The state machine. The weights for transitions between the
states have been shown. The circled numbers are the initial weights
of the corresponding states. (b) How a path of minimum cost is traced
out using the Viterbi algorithm. The cumulative weights are shown in
bold and the final minimum cost path is also shown in bold.
Consider Fig. 1(b). We take an observation
sequence
,...,
of length four. At time t=2 state 1
can be visited from any
of the three states that we
had at time t=1. We find out the weights on each of these paths and
add corresponding weights to the initial weights ( we shall call this
cumulative weight at state 1
as the cost at state 1
at time t=2). Thus going from
state 1
to state 1
the cost at state 1
is 3+6=9. Similarly the cost at
state 1
in going from state 2
to state 1
is 3+4=7 and the cost in going
from state 3
to state 1
is 2+3=5. Of these the cost 5 is
minimum. Hence we retain this cost at state 1
for further
calculations. The minimum cumulative weight paths are shown in the
figure by arrowed lines. The cumulative weights have been shown in
bold alongside the respective states at each time instant.
We repeat the same procedure for state 2
and state 3
. We
see that the minimum costs at state 2
and state 3
are 6 (through state 3
) and 6 (through state 1 ) respectively.
We repeat the above procedure again for t=3 but now using the costs calculated above for each state rather than the initial weights ( we used them above because we started with t=1). And the same procedure is repeated for t=4. Now select out the state which has minimum cost of all the states. We see that state 1 is the required state with a cost of 11. Back tracing the sequence of states through which we got at state 1
at time t=4 gives the required sequence of states through which the given observation sequence has highest probability of occurrence. As can be seen from the figure this state sequence is state 3 ,state 1 ,state 3 ,state 1 . This sequence has been shown in bold in the figure.
To prove our point suppose you were given that the length of
observation sequence is required to be two and that the last state is
to be state 3
. What path would you choose to minimize the cost (i.e. to
maximize the probability of the observation sequence
) ? The
procedure outlined above clearly shows that we would choose the path
beginning at state 1
and ending at state 3
(at t=2) as that gave the
minimum cost (viz. 6) at state 3
. All other paths have a higher
cost. Similar argument applies if any other state were required
to be the last state. Similarly if the observation sequence were to be
of length three and ending in say state 1
we would choose the path
state 1
,state 3
,state 1
as outlined in the figure and described above. This
means that at any given instant the path tracked upto any state by the
above procedure is the minimum cost path if we were to stop at that
instant at that state. Proceeding to t=4 we see that we have the
minimum cost paths corresponding to stopping in
state 1
,state 2
or state 3
respectively. We just pick up the smallest of
these three because we are interested in the minimum cost and
hence we choose to stop in state 1
which gives us the minimum cost.
Now that we know how the Viterbi Algorithm works here is how it can be implemented [1,3] :
Note :
( i)
denotes the weight accumulated when we are
in state i at time t as the algorithm proceeds and
( j)
represents the state at time t-1 which has the
lowest cost corresponding to the state transition to state j at time
t. For example in Fig. 1(b),

and

for


gives the required state-optimized probability , and
is the
optimal state sequence.
A little reflection over the above steps will show that
computationally the Viterbi Algorithm is similar to the forward
(-backward) procedure except for the comparisons involved for finding
the maximum value. Hence its complexity is also of the order of
. This solves Problem 2.
This problem deals with training the HMM such that it encodes the observation sequence in such a way that if a observation sequence having many characteristics similar to the given one be encountered later it should be able to identify it. There are two methods that we shall discuss depending on which probability is chosen for identification:
are adjusted to maximize P( O,I
)
where I
here is the optimum sequence
as given by the solution to Problem 2 above.
are adjusted so as to increase P( O
)
until a maximum value is
reached. As seen before calculating P( O
)
involves summing up
P( O,I
)
over all possible state sequences I. Hence here our focus is not on a particular state sequence.
This method takes us from
to
such that
where,
is the optimum sequence for O =
,
,...,
and
, found according to
the solution to Problem 2 . This criterion of optimization is called the maximum state optimized likelihood criterion.
This function
is called the state optimized likelihood function.
In this method to train the model a number of (training) observation sequences are required.
Let there be
number of such sequences available. Each sequence O =
,
,...,
consists of T observation
symbols. Instead of
number of such sequences we might be
given just one very long
sequence; in that case we segment it into some conveniently chosen
number of short sequences each of length T.
Each observation symbol (
) is assumed
to be a vector of dimension D (
).
The algorithm then consists of the following steps:
observation vectors
to one of these N vectors from which its Euclidean distance is minimum.
Hence we have formed N clusters each called a state (1 to N).
This initial choice of clustering vectors doesn't decide the final
HMM that we get but can decide the number of iterations required for
the HMM training. In our case we just pick up N equally spaced
sequences of feature vectors and pick one vector from each of
these sequences. The first vector is taken as the first vector of
the first of these sequences, the second as the second of the
second of these sequences and so on. Of course this is just to
make the initial choice of clusters as widely distributed as possible and one is at liberty to choose the vectors as per his
need.

For
and



(as given by the
solution to Problem 2) for each training sequence using
computed
in steps 2 to 4 above. A vector is reassigned a state if its
original assignment is different from the corresponding
estimated optimum state; for example suppose
of the
5th training sequence was assigned to state 3 (i.e. the 3rd
cluster) and now we find that the in the optimal state sequence
corresponding to the 5th training sequence,
is not 3 but say 4. Hence now we reassign
of the 5th training sequence to state 4. Hence we
assign
(of say kth training sequence ) to state
i
if
(corresponding to the kth training
sequence ) is state i
and this is done for all training
sequences (i.e. for k=1 to
).
This method assumes an initial HMM model which is improved upon by using the formulas
(given below) so as to maximize P( O
)
. An initial HMM can be constructed in any way
but we may use the first five steps of the Segmental K-means Algorithm described above to
give us a reasonable initial estimate of the HMM. Henceforth we shall assume that an
initial HMM is known. This algorithm maximizes P( O
)
by adjusting the parameters of
.
This optimization criterion is called the maximum likelihood criterion.
The function P( O
)
is called the likelihood function.
Before we get down to the actual formulas we shall introduce some concepts and
notations that shall be required in the final formulas. Consider
( i)
defined as follows:

i.e. the probability of being in state i
at time t given the observation sequence
O =
,
,...,
and the model
. By Bayes law we have :
where
( i)
and
( i)
are defined by (5)
and (10) respectively.
( i)
accounts for
,...,
and state i
at
time t, and
( i)
accounts for
,...,
given state i
at time t .
We also define
( i,j)
as :

i.e. the probability of being in state i
at time t and making a transition to state j at time t+1,
given the observation sequence O =
,
,...,
and the model
.
Using Bayes law it is seen that
But O =
,
,...,
. Hence
Consider the second term. The observation symbol at time t+1
is
at which time we are required to be in state j; this
means that the observation symbol
has to be picked up
from state j
.Hence we have
Hence combining (5),(32),(34) and (37) we get

Here
( i)
accounts for
,...,
,
accounts
for the transition to state j
,
(
)
picks up the symbol
from state j ,and
( j)
accounts for
the remaining observation sequence
,...,
.
If we sum up
( i)
from t=1 to T we get a quantity which can be viewed
as the expected number of times state i is visited or if we sum up only upto T-1 then we shall get
the expected number of transitions out of state i (as no transition
is made at t=T). Similarly
if
( i,j)
be summed up from t=1 to T-1 , we shall get the expected number of transitions from
state i
to state j
. Hence

Now we are prepared to present the Baum-Welch re-estimation formulas:

The re-estimation formula for
is simply the probability of
being in state i at time t. The formula for
is the ratio of expected number of time of making
a transition from state i to state j to the expected number of times of making a
transition out of state i. The formula for
( i)
s the ratio of the expected number of
times of being in state j and observing symbol
to the
expected number of times of being in state j .
Note that the summation in the last formula goes upto t=T.
If we denote the initial model by
and the re-estimation model by
consisting of
the parameters estimated above, then it can be shown that either:
is a critical point of the likelihood function in which case
, or
i.e. we have found a better model from which
the observation sequence
,
,...,
is more likely to be produced.
)
is maximized.
This solves Problem 3.
We have seen that the Baum-Welch algorithm maximizes P( O
)
;
given by (4)
. Each term in the summand of this equation is
between 0 and 1 and hence the summand (which is a product of many such
terms) is going to be very small and there are as many such small
terms as the number of possible state sequences -- for a computer to
be able to compute such terms individually, it should be able to store
within its precision the smallest of these terms. If T=100 there
would be around 200 numbers (each between 0 and 1) to be multiplied
and hence each term could comfortably go beyond the
precision of any computer. Taking logarithm does not help here since
P( O
)
is given by a summation of such small terms. Perhaps some
clever scaling procedure could be used to get over this problem.
On the other hand the Segmental k-means algorithm maximizes P( O,I
)
which is calculated using the Viterbi Algorithm as given by (18) --(24) . Here we can use the logarithm to take care of product of many small numbers as no summation is involved. The Segmental k-means algorithm is also preferred because most of the time we are interested in the occurrence of the observation sequence from the best possible (i.e. the optimum) state sequence and considering the occurrence of the observation sequence through all the possible state sequences is not the intended representation of the problem in most cases. Moreover the Segmental k-means algorithm requires much less computation as compared to the Baum-Welch method. For all these reasons we have used the Segmental k-means algorithm in our work.
It is hoped that this would serve as good introduction to HMMs for anyone to venture further into the realm of HMMs.
A TUTORIAL ON HIDDEN MARKOV MODELS
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 /home/yaman/ubdesai/html/HMM_TUT/newhmmtut.tex.
The translation was initiated by Prof U B Desai on Wed May 29 14:48:23 IST 1996