Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 217
11
Markov Moclels for
Speech Recognition
Alan F. Lippman
University of Washington
~ I. ~ Introduction
The goal of speech recognition research is to enable machines to reduce
human speech to a list of printed words. By imposing restrictions such as
a limited vocabulary and grammar, automated speech recognition systems
have been produced during the last 15 years that, within those constraints,
approach human performance. Sustained research efforts have resulted in
systems that place substantially fewer restraints on the speaker. Earlier
recognition systems typically required that the words spoken belong to a
fixed small vocabulary, that the speaker pause between each word, and that
the speaker participate in a training period during which the system would
automatically adjust to that particular speaker (and no other).
Each of the constraints above was used to reduce the complexity inherent
in natural speech. This chapter presents an introduction to concepts under-
lying much of the work in speech recognition anti, in the process, explains
how the constraints above simplify the problem. The chapter then presents
a detailed description of a simple probabilistic speech recognition system,
modeled after the SPHINX system t14], that implements hidden Markov
moclels (HMMs).
Hidden Markov models are the basis for many recently developed speech
recognition systems and are related to Markov Random Fields (MRFs),
which have been successfully applied to some image-processing tasks (see
chapter 3~. Both approaches rely on similar probabilistic frameworks to de-
scribe and exploit the relationship between the items of interest (e.g., the
2~7
OCR for page 218
218
microphone recording and its transcript, the degraded and "true" images).
Both approaches share some fundamental tools: maximum likelihood esti-
mation to estimate parameters, maximum a posterior) (MAP) estimation to
perform recognition/restoration, and the use of Markovian models to make
these estimation problems tractable. However, recorded speech, unlike im-
ages, is not a spatial process, it is a function of pressure (or through a
microphone, voltage) versus time. This chapter provides a quite different
view of some of the methods of chapter 3.
Il.2 Basic Speech Concepts
Some of the basic units of speech are the sentence, the word, and the
phoneme. There are approximately 40 phonemes, each corresponding to
a distinctive sound. The symbol list that dictionaries provide after each
word (as a pronunciation guide), is a list of phonemes. Phonemes, words,
and sentences are all fundamental to the way we produce speech; people
think in terms of sentences and words, while phonemic descriptions are nec-
essary for people to pronounce words correctly. Modern speech recognition
systems function in a similar way: a hierarchical approach is used where
sentences are modeled as series of words; words are modeled in terms of
phonemes; and phonemes are modeled as series of features of the signal. By
nesting these three layers of models, printed sentences (our desired goal) are
related to the speech signal.
Neither words nor phonemes are easily identified in the speech signal.
Even finding the boundaries between two words can be a difficult task, since
in the speech signal, phonemes as well as words may merge together and no
simple signal processing can separate them. A spoken phrase like "this is,"
can easily be interpreted as "the sis" (or even by some graduate students
as "thesis"~. Most people have had the experience of listening to a foreign
language and hearing only a steady stream of sounds, illustrating that one
must understand speech to find word boundaries. (This type of difficulty is
familiar to those who work in image segmentation; it is often impossible to
find the boundaries between objects without some knowle(lge about what
the objects are.)
Isolated-word recognition systems require that the boundaries between
words be obvious. This is typically accomplished by requiring a speaker to
pause between words. These silent pauses can be easily identified in the
signal. Individually spoken words also tend to be enunciated more clearly,
OCR for page 219
219
aiding recognition. The main drawback of such systems is that the speaker
is constrained to speak in a slow, unnatural manner. Continuous-speech
recognition systems lack this constraint.
The construction of even an isolated-word speech recognizes is difficult.
The speech signal associated with a word or a phoneme is extremely vari-
able, and can vary greatly depending on both the speaker's identity and the
manner in which the word was spoken. Variability is caused by anatomical
differences between speakers, such as sex or vocal tract length, as well as
by differences in style, health (presence or absence of a cold), speaking rate,
stress, or accent. A quickly spoken word will frequently be slurred or have
parts "missing." Accents can result in the wholesale shifting of vowels [94.
In addition, some words have many allowed (as opposes! to recommended)
pronunciations. This is especially true for common wordsq like "the`" that
are typically articulated poorly ("the" is often pronounced as "dee," "dah,"
"dih," "thah," or "thigh". ~
bpeaker-ctepen~ent systems simplify the recogni-
tion problem by adapting themselves to one particular speaker, removing
some of the causes of variability.
The speech signal associated with a Phoneme also varies depending on
the context in which it is pronounced. This effect is called co-articulation.
As people speak, the tongue, lips, and other articulators must move from
a position necessary to pronounce the current phoneme to a position that
will result in the next phoneme being produced. The articulators tend to
move only far enough for speech to be intelligible, so current positioning
is affected by the previous positioning. The placing of the articulators can
also be influenced by the position that should be occupied next, a form of
"thinking ahead" that people perform automatically.
For small-vocabulary recognition systems, the concept of phonemes typ-
ically is not used. Many examples of each word are used to discover a direct
relationship between the word and the speech signal. In this way the effect of
co-articulation is modeled implicitly. Since the required number of examples
will grow linearly as a function of vocabulary size, this type of approach is
almost impossible for vocabularies containing more than a thousand words.
Phonemes, or some similar concept, are often employed by large-vocabulary
systems.
Perhaps the most challenging problem in speech recognition research is
that of modeling sentences. Uniess words are enunciated very clearly, con-
fusions between phonetically similar words are inescapable. While a person
would pick the sentence that makes the most sense, an automated system
must rely on a sentence model. Different types of models have been used,
OCR for page 220
OCR for page 222
OCR for page 223
OCR for page 224
OCR for page 225
OCR for page 226
OCR for page 227
OCR for page 228
OCR for page 230
OCR for page 231
OCR for page 232
OCR for page 233
OCR for page 234
Representative terms from entire chapter:
speech signal
220
ranging from classification trees [1] to N-step Markov chains t2] (the proba-
bility of the current word conditioned on all past words being only dependent
on the previous N words). If the speaker obeys a highly constrained gram-
mar specified by the system (e.g., the word "come" can only be followed
with "here," "over," or "back"), it becomes much easier to automatically
recognize sentences. A measure of the constraining power of a grammar
is the perplexity (for a definition see [24~. Automated systems that allow
large vocabularies and employ a grammar whose perplexity is close to the
perplexity of spoken English, can be said fairly to handle natural tasks.
It.3 Some Recent Systems
All speech recognition systems require restricted input to achieve good accu-
racy (around one word in twenty wrong). Table 11.1 provides a short guicle
to some recent systems and the constraints under which they operate.
Almost universal is the requirement that the speech be recorded in a
Tow-noise environment; a device that operates in a cocktail lounge or on a
construction site may not be seen for decades. Other standard requirements
are describer! by the terms continuous speech, speaker indepenclent, large vo-
cabulary, and natural task. Large vocabulary systems in this table recognize
more than three hundred words.
TABLE 11.1: Some Speech Recognition Systems and Their Abilities
SYSTEM I DATE | Speaker | Continuous | Large | Natural Ta
221
11.4 Signal Processing
Speech signals are typically sampled at S-16 kHz (S to 16 thousand samples
per second), where the value of each sample is specified by 10-12 bits. The
first step in most recognition systems is to process this signal, both to reduce
the amount of data and in an attempt to extract the features of the signal
that are relevant for recognition.
Some processing methods begin by taking short-term Fourier transforms;
short (on the order of 20 milliseconds), overlapping (by on the order of
10 milliseconds) frames of the signal are transformed into vectors whose
components are the energies associated with (approximately 10) differing
frequency bands. In this manner the speech signal would be represented by
a small number (on the order of 100) of vectors per second. Other processing
methods fit autoregressive (AR) moclels (of order ~ to 16) to these short,
overlapping frames of the speech signal. In that approach the speech signal
is represented by a sequence of AR parameter vectors. Note that whereas
each frame of the signal may contain 320 values (20 ms of a 16 kHz sampled
signal), this first processing step reduces it to a vector of dimension 16 or
less.
The final step of most processing algorithms is the use of vector quan-
tization [19i, which reduces each vector to an acoustic label belonging to a
small discrete set (containing on the order of 256 elements).
Briefly described, the use of vector quantization first requires that stan-
dard techniques be used to find cluster centers in a set of vectors obtained
from a representative variety of recorded speech. Each of these cluster cen-
ters is given a label. Vector quantization replaces any vector in the high-
dimensional space with the label of the closest cluster center. In this way,
a 16-dimensional vector of AR parameters could be represented by a single
byte.
Although good signal processing is critical to the successful performance
of a recognition system, it is beyond the scope of this discussion, and we
refer the reader to [2], [14j, and [7] for further details. For the remainder of
this discussion, it is assumed that the speech signal is already described by
a series of acoustic labels, each of which belongs to a small, fixed set.
Il.5 Probabilistic Recognition
The most successful approaches to the speech recognition problem use proba-
bilistic modeling. The processed speech signal y = Lye, . . ., An) is considered
to be an observation of n random variables (R.V.s) Y = (Y~, . . ., IN ). A
=
222
sentence or string of words w = (w~,...,wm) is considered to be an obser-
vation of m R.V.s W = OWN, . . ., Wm). For a fixed series of recorded acoustic
labels y, the value of the conditional distribution P(W=w~Y=y) specifies
the probability that the words w are the "script" of the recording. Speech
recognition can be accomplished by finding the word string that maximizes
this conditional distribution. (This is MAP estimation.)
By Bayes' rule,
P(W= why =y)
pty=y,W=w)/P(y=y)
P(Y = yew = w)P(W = w)/P(Y = y)
For any fixed recording, the value of P(Y=y) is a constant and the w that
maximizes P(W = why = y) also maximizes P(Y = y, W = w) and P(Y =
yew = w)P(W = w). Instead of constructing the conditional distribution
P(W = why = y), we shall consider two, wholly separate problems. The
first is the construction of a distribution P(W = w) on sentences; this is
the modeling of grammar. The second is the construction of a distribution
P(Y= yew= w) on acoustic label strings; this is the modeling of speech
production.
The remainder of this chapter is designed to provide a brief introduction
to the techniques used to implement the above approach. The construction
of P(W = w) is not discussed. (The interested reader should refer back to
§11.2 for a few references regarding the choice of a grammar.) We concen-
tate instead on presenting in some detail a simplified parametric model for
P(Y=y~W=w) similar to the SPHINX baseline system [14], which formed
the basis for the SPHINX system, a successful large-vocabulary, continu-
ous speech, speaker-indepenclent recognition system. (We recommend [14]
as a detailed guide to the construction of a complex and functional speech
recognition system.) Discussion follows on estimating the parameters of this
distribution (511.11~. The final topic is the identification of the word string
that has maximal probability conditioned on the spoken signal.
11.6 Image-Processing Parallels
This probabilistic approach to speech recognition has many points in com-
mon with Bayesian image processing using MRFs. A typical digital picture
is a specification of intensities or colors at the sites of a finite two-dimensional
lattice 1; = {(i, j)~'N=o. Modeling can be accomplished by introducing two
223
types of random variables, F={Fij}ij~L, corresponding to an observed pic-
ture, and U = {Uij~ijeL, corresponding to an unobserved "true" picture.
One method of image restoration is to calculate for any observed image f
the u that maximizes P(U=u~F=f). Bayes' rule is applied, just as it was
in speech, to reduce the construction of P(U=u~F=f) to the construction
of two separate distributions: P(U = u) and P(F = flu = u). P(U = u)
is caned the prior distribution and has the same role that P(W = w), the
sentence model, did for speech. P(F = MU = u) is the degradation model,
and is the analogue of the speech production model.
Both image-processing tasks and speech recognition require the place-
ment of distributions on very large (~ 2~°°°), but finite, spaces. Both rely
on Markov assumptions to allow computational feasibility. Major differences
are that the speech problem is inherently one-dimensional, whereas pictures
are multidimensional. The inherent one-dimensionality of speech signals al-
tows the use of fast estimation and search techniques. Although some image
models allow the use of similar techniques t11], the class of such models is
highly restricted and may not be particularly useful.
I l.7 Mocleling of Speech
The recognition system we describe uses phoneme models as an intermediate
stage between acoustic labels and words. For every phoneme we will form
a distribution on acoustic label strings produced during the enunciation of
the phoneme. These distributions take the form of HMMs. In our simplified
presentation, the effects of co-articulation wiD be ignored; the distribution
of the acoustic labels associated with a given phoneme will be assumed to
be independent of the neighboring phonemes.
A more ambitious speech recognition system would model phonemes in
context. In such a system, the goal would still be to put a distribution
on acoustic strings produced during the enunciation of the phoneme. How-
ever, the distribution would also depend (commonly) on the preceding and
following phonemes (e.g., a distribution would be formed for the acoustic
label strings associated with the phoneme OR\ when that phoneme is in
the phoneme string \TB\ \R\ \ZY\~. The fundamental difficulty of this ap-
proach is that the number of such (distributions would be approximately 403,
and parameter estimation becomes impossible. However, clever implemen-
tations of context-dependent phoneme models have been made, and we refer
the reader to [14] for details.
224
The use of phoneme models (either context-dependent or context-inde-
pendent3 usually necessitates the assumption that the distribution of the
acoustic labels produced during a given phoneme, or that lies within a given
context for context-dependent models, be independent of the acoustic labels
produced during other phonemes. This assumption allows the production
of acoustic labels for a string of phonemes to be considered as though each
phoneme was produced independently and the results concatenated. This
type of assumption is necessary in order to build models by "instantiation,"
a technique that is described in §11.10.
Although many words have multiple pronunciations this model assumes
con ~ ~ ~ ~
. . . . .
that each word has a unique pronunciation, and therefore a unique phonemic
representation. This assumption is used in some state-of-the art systems.
Such an assumption forces the phoneme models to model implicitly some of
the variability in pronunciation.
In the remainder of this chapter it is assumed that a grammar, and thus
P(W = w3, has been specified. The modeling strategies and assumptions
above will be used to produce P(Y = y jW = w3. Combining this with the
value for P(W = w3 yields P(W = w, Y = y3.
Il.S Hidclen Markov Modeling
First, we introduce hidden Markov moctels (HMMs3, and then describe their
use in speech recognition. A HMM describes the behavior of two series of
discrete R.V.s, cad them X = (X~,X2,...3 and Y = (Y~,Y2,.~-3 The X
series is caHect the hidden R.V.s, and the other series is called the observed
R.V.s. The conditional distribution of Xi given the values of all the previous
R.V.s (Xj, Yj,Vj < it will be dependent only on the value of Xi_~ (and not
on it. The conditional distribution of Yi, given the values of all other R.V.s
(both hidden and observed), will be dependent only on the value of Xi and
Xi_~ (andi not on i):
P(Xi=xi~Xj=xj,Yj=yi'~j < i)
P(Yi = yi~Xj =Xj, Yj = yj, JO ~ i)
P(Xi = Xi Xi-i = xi-i ~
P(Yi=Yi~Xi=Xi,Xi-l =Xi_~) -
The "hidden" part of a HMM arises from its use. Only observations of
the R.V.s Y will be used to estimate the transition matrix P(Xi~Xi_~) and
the conditional distribution P(Yi ~Xi, Xi_, ). These conclitional probabilities
are sometimes called the parameters of the modes.
225
For each phoneme, we construct a distribution on acoustic label strings y
and hidden state strings x. Based on these distributions we can construct the
distribution P(Y = y, X = x~W = w). Note that this distribution specifies
the distribution P(Y = yaw = w).
11.9 The SPHINX Phoneme Mode}
We now present a model for the production of one phoneme. Note that
although a phoneme may be pronounced either quickly or slowly, the acoustic
labels are produced at a constant rate. A phoneme model must allow the
production of variable numbers of acoustic labels. As shown in Figure 11.1~
1 ~~ ~ ~~1,~ ~ ~~ ~11 ~ ~ _.~1 ~~ an_ ~ _~11 ~ ~ ~~ ~
~ — - ,
el Hi bake on seven allowed values, and call sued o1,..,S7. X1 equals S
with probability 1. In Figure 11.1, arrows connect two states Si and Sj
(possibly the same) if P(Xk=Sj~Xk_~=St) is ahowed to be nonzero. With
every allowed transition (from Si to Sj) there is an associated character, B.
M, or E, denoting the beginning, middle, or end of the pronunciation of the
phoneme. The distribution of Yi conditioned on observations of all the other
R.V.s wid only depend on the character associated with the transition from
Xi_~ to Xi (e.g., from Figure 11.1, P(Yi=yt~Xi-~=s2'Xi=s3) = PM(Yi=
Yip. When S7 iS reached, the pronunciation of the phoneme is complete.
Note that PB, PM, and PE are distributions on Y. and recall that an
acoustic label can typically have any one of a few hundred values. The
distributions PB, PM, and PE will not be parametrized, so the specification
of each distribution requires the specification of many probabilities. For the
non-baseline SPHINX system, approximately 700 probabilities are needed
to describe each of the distributions PB, PM, and PK.
Hidden Markov models possess desirable properties. The observed R.V.s
(Y) behave in a very simple fashion (they are independent) when conditioned
on the hidden R.V.s (X), but the marginal distribution of Y possesses no
such independence. The behavior of the observed R.V.s is quite compli-
cated. The model above allows variable length label strings, and allows
the probabilities of strings of length 1, 2, and 3 to be anywhere between O
and 1. The loops at states S2, S3, and S4 allow the phoneme to idle at a
particular state, helping to model the various ways in which phonemes can
be extended in duration (e.g., some phonemes may occasionally have tong
"middles," other phonemes may always have short "middles"~.
~ r ~ _ ~ \
226
Q
-
Ma
it'< Y _~7J
-
-
FIGURE 11.1: Allowed transitions.
11.10 Instantiation
We now use the phoneme models and the assumptions listed in the por-
tion on modeling to construct P(Y = y, X = x~W = w). The distribution
P(Y = y, X = x~W = w) is formed by using the values of the probabili-
t~es ~~_~' and ~tY`~,~_~' as specified by appropriate phoneme
models. The manner in which this is done is called "instantiation," and is
described below. For each sentence w = Owl, we, .., wm) there is an associ-
ated string of phonemes p = (Pi, Pi, · · ·, PI ~ ~ P2 P2 2 pnm ~ where pi
is the Ah phoneme in the ith word. The total number of phonemes associated
with the sentence is n = Hi nit The distribution P(Y = y, X = x~W = w) is
a HMM where the hidden variables Xi can take on values in
. . _ ~ _ _ ~ _ ~ ~ ~ ~ ~ ~ ~ . ~ ~ a
{S1(P1)' · · · ~ 56(P1), S1(P1)' · · ~ 56(P1), · · ~ Sl(Pm )' ~ 57(pmm)} ~
This set contains 6n+ 1 states, six for each phoneme and one state signifying
the end of the sentence. Note that the state of the hidden variable Xi
specifies the word, phoneme, and part of phoneme that is being produced
at time i.
The distribution P(Y = y,X = x~W = w) is formed by defining the
transition probabilities
P(`X~ = Sj(~pk)~X~_~ = Si(~Pt )), 1 ~ i, j < 6, 1 < k < m, 1 < ~ < nk
to be the same as those specified by the phoneme model for the phoneme
227
~ (SH) M ($ H) E(s H)
\ ~ \ /
FIGURE 11.2: Graphical representation of the HMM for the word string
she.
pi. The values of
P(X~=S~(pik i)~X~_~=S,fpi)), 1 ~ i < 6, ~ < k < m, 1 < ~ < nk
P(X~=S~ (pk+~X~_~ =Si~pi )), 1 < i < 6, 1 < k ~ m, ~ = nk
P(X~ = S7 (pi )X_ = Sit )), 1 < i < 6, k = m, ~ = nm
have the values that transitions to S7 had in the phoneme mode! for Pk- AD
the other transition probabilities have value zero. When X~ is observed to
have value S7(p7tmm), the sentence has been completed.
The distributions P(Yi~Xi,Xi_~), similarly, are those associated with
the related phoneme model. To account for between-word pauses, a nonzero
term P(X' = So (pk)~X'_~ = So (pk)) can be added where PLY = silence~X~ =
Sl,fpk),X~_~ = Sl (Pk)) = 1; for the sake of clarity, we wit! not discuss this
detail.
Figure 11.2 is a graphical representation of the HMM for the word string
she. Notice that the distribution P(Y = y, X = x~W = "she") contains no
explicit mention of phonemes or words. It is a distribution on strings of
acoustic labels and hidden states.
As indicated previously, any string of hidden states x is associated with a
unique word string. Note that this forces the value of P(Y=y,X=x,W=
w) to be either zero or equal to P(Y=y, X =x).
228
11.~l Computational Issues: Parameter
Estimation ant! Searches
The model for P(W = w,Y = y) is based on phoneme models. The esti-
mation of parameters for this distribution consists primarily of estimating
the parameters of the phoneme models. The simplest way to estimate the
parameters of phoneme models would be to excise examples of phonemes
from speech, and estimate parameters for each phoneme separately. This is
not done, perhaps because the excision of phonemes (accomplished by hand)
cannot be done in a manner consistent with the assumptions implicit in the
phoneme models. Most current systems estimate parameters using training
data consisting of utterances of whole known sentences. Below we present
the algorithm used to perform this very Circuit computational task. We
also briefly present the typical approach used to speed the computations
associated with the use of a trained system.
We wish to estimate the parameters of a HMM from one or more obser-
~r~tir~nc Of l
229
to consistently produce strings, ant! that one of these terms shouIc! be > .8
because a string of length 22 would otherwise be quite unlikely.
One of the reasons for the success of HMM is the existence of a compu-
tationaLy efficient method for approximate maximum likelihood parameter
estimation (as opposed to the completely ad hoc estimation above). Starting
with an initial guess for the parameters, and an observation y = (ye, . . ., AT)
of Y = EYE,... ,YT), the Baum-Welch algorithm [5,4] (also known of as the
forward-backward aIgorithm) is an iterative method for selecting parameters
that ensures that at every iteration the likelihood increases. Convergence to
a local maxima of the likelihood is guaranteed. The Baum-Welch algorithm
is equivalent to, and predates, the expectation maximization (EM) method
[105.
We present here the Baum-Welch algorithm. Let
ajk = P(Xi=k~Xi_~=j) j,k= 1,...,s
bJk(~) = P(Yi = ll~Xi = k, Xi_ = j) j, k = 1, . . ., s, ~ = 1, . . ., r .
Assume, for simplicity, that aj = P(Xo = j) and bj (lo = P(YT = ~ XT = j) are
known. Let Pab be the distribution on (X0,...,XT, Y1,- -,YT) generated
by a and b. The likelihood of y is ~= P(Y = y, X = x), which can be written
as
s
Pab(Y = y) = At, ado afoul boot ( Y1 tail j2 bil j2 (Y2 ~ · · · at—l iT bar ( YE ~
. . .
}0,}1 ,...,:~1
Note that a naive calculation of the expression above would require an ex-
treme amount of computation, the performance of 2T x sT+i multiplication
operations. A more efficient approach is to rewrite it as
Pab(Y=Y) = ~' ajO ~ ajoj~bjoji(yi) ~ ajij2bji j2(y2) · · · ~ ajT—~jTbjT(YT)
Jo = ~ Jo = ~ i2=1 AT=}
and perform the computation by first calculating ~jTajT_:jTbjT(YT) and
storing its value for ah values of jet-. Then the sum over iT-i can be com-
puted and stored for ah values of iT-2. Repeating this until the likelihood
is evaluated results in 2T x S2 multiplications being conducted.
Each iteration of the Baum-Welch algorithm results in new estimates
(ajk) and {bjk(1~:, based on the slate y and on the previous estimates jacks
and (bJk(~)
- Ei Pab(Xi-l =j'Xi =k~Y=y)
j Zi Pab(Xi—~ = jay = y)
230
b~ (1) Pi you Pab(Xi—1 =i, Xi = klY = y)
Zi Pab(xi-l = i' Xi = klY = y)
These re-estimation equations are the heart of the algorithm. For those
more familiar with the EM algorithm, the above formulas can be interpreted
as the calculation of ~iEab(X~j~k,`~(Xi'Xi_l,Yi)~Y=y), where Xi denotes
the indicator function for the value i; Xi(j) = 1 if i = j, and 0 otherwise.
These indicator functions are the natural sufficient statistics when a HEM
is represented as an exponential family (which can be accomplished via
the Gibbs-MRF equivalence). The maximization step of the EM algorithm
becomes trivial in this case. We shall not present any proof that the above
re-estimation increases the likelihood of y; instead we refer the reader to
t4,54.
The Baum-Welch algorithm in addition to the formulas above, specifies
a method! to calculate the new estimates quickly. This is essential since the
distribution Pab(Xi_~ = i,Xi = key = y) is dependent on all the values of
Y. and a naive calculation would require as many operations as a naive
calculation of the likelihood. However, we can implement a computational
strategy similar to that introduced above to compute thelikelihood. The
re-estimation equations above can be written as
a — Ei °ti(i)Fi+1 (k)ajkbjk(Yi+1 )
ok — Hi (>i(j)3i(i)
b (I) ~' :y' =i ~ (A )>i+l ( k )a jk by k ( Yi+ l )
where cat and ,l] can be defined inductively in i, forward and backward re-
spectively, by
s s
0ti+1(k) = ~°bi(i)ajktjk(Yi+~), pi(i) = ~>i+l(k)aJkbjk(Yi+l)-
j=] j=1
The implementation of the Baum-Welch algorithm for speech recognition
depends, obviously, on the form of training data. The standard scenario, as
mentioned above, is that there is a list of known sentences and pronuncia-
tions of them. We can therefore construct a HMM P(Y=y, X=x~W =w)
for each known sentence (as in §11.10~. The estimation of parameters for
the HMMs can proceed by the Baum-Welch algorithm with the simple mod-
ification that the iterations be performed synchronously. One iteration will
be conducted for each HMM sentence model, then the estimated parameters
231
for each sentence will be combined into one estimate for Al the parameters
of all the phoneme models at the end of every iteration.
The performance of the Baum-Welch algorithm depends highly on the
quality of the initial estimates. While every iteration of the algorithm guar-
antees an increase in the likelihood, a bad initial guess may cause con-
vergence to a bad (Iow) local maximum of the likelihood. Whereas whole
sentences are used to train the phoneme models, the initial estimates for
the distributions P(Yi~Xi,Xi_~) often come from excised phoneme data. In
[15], for every phoneme the three distributions PB(Yi), PM(Yi), and PE(Yi)
are initialized to a histogram of the acoustic label values associated with
that phoneme in hand-labeled data. The initial estimates of the transition
probabilities of the hidden states were chosen so that all allowed transitions
from a state had equal probability.
Another interesting implementational detai!is that the Baum-Welch al-
gorithm is typically used for only 2 or 3 iterations A, page 35], [144. On
the other hand, EM is well known for slowness after the first few iterations.
In [14] the performance of the recognition system is stated to worsen with
continued iterations, suggesting to us that an overfit of the training data is
occurring.
Once parameter estimation is accomplished, there remains the use of the
recognition system. As was stated at the beginning of this section, our goal
is the calculation of the string w that maximized P(W = why = y). The
use of Bayes' rule aBowed us to modify this to the calculation of the string
w that maximized P(Y = y,W = w). Recall that we have constructed
P(Y = y, X = x, W = w), which equals P(Y = y, X = x) when w is the word
string associated with x, and zero otherwise.
The string w that maximizes P(Y=y,W=w) is usually approximated
by the w associated with the x that maximizes P(Y = y,X = x). The
principal justification for this approximation, besides its success and com-
putational simplicity, is that the most likely word string should have at least
one corresponding hidden state string that win also be very likely. The most
likely string of hidden states, for small vocabulary and simple grammar sys-
tems, can be found by a simple dynamic programming [6] scheme called the
Viterbi algorithm [2~. For more complicated systems the search is performed
by more ad hoc methods that are based on dynamic programming [2,14]
11.12 Future Directions
The construction of a large-vocabulary, speaker-independent, complicated-
grammar speech recognition system from scratch is a daunting task. How-
232
ever, it is one that new researchers interested in speech recognition will not
have to face. Databases for speech recognition are becoming commonly avail-
able. As a result, various fascinating and extremely challenging subproblems
can bee approached by single researchers on current generation workstations.
One such problem is the speaker-independent recognition of phonemes in
continuous speech; another is the recognition of connected digits.
Whereas HMMs have been the most successful approach to date, the fun-
damental reason for their current superiority is the dedication and creativity
of those who have implemented them. Preliminary research indicating that
other approaches can be as accurate and computationally feasible is pre-
sented in [174. It is hoped that, as the computational resources to approach
the speech recognition problem become available to a larger community, a
diversification of approaches wiD occur and that this chapter encourages
research in this direction.
Bibliography
[1] BahI, L. R., P. F. Brown, P. V. De Sonza, and R. L. Mercer, A tree
based statistical language model for natural language speech recogni-
tion, IEEE Trans. Acoust. Speech Signal Processing 37 (1989), 1001-
1008.
[2] Bahl, L. R., F. Jelinek, and R. L. Mercer, A maximum likelihood ap-
proach to continuous speech recognition, IEEE Trans. Pattern Anal.
Machine Intel. 5 (1983), 179-190.
[3] Baker, J. K., The DRAGON system—An overview, IEEE Trans.
Acoust. Speech Signal Process. 23 (1975), 24-29.
233
[4] Baum, L. E., An inequality and associated maximization technique in
statistical estimation for probabilistic functions of Markov processes, in
Inequalities Ill, Academic Press, 1972.
t5] Baum, L. E., T. Petrie, G. Soules, and N. Weiss, A maximization tech-
nique occurring in the statistical analysis of probabilistic functions of
Markov chains, Ann. Math. Stat. 41 (1970), 164-171.
[6] Bellman, R. E., Dynamic Programming, Princeton University Press,
Princeton, 1957.
[7] Brown, P. F., The Acoustic-Mode1ting Problem in Automatic Speech
Recognition, Ph.D. thesis, Computer Science Department, Carnegie
MeHon University, Pittsburgh, 1987.
i8] Cole, R. A., R. M. Stern, M. S. Phillips, S. M. Brill, P. Specker, and
A. P. Pilant, Feature-based speaker independent recognition of English
letters, IEEE Inter. Conf. Acoust. Speech Signal Process. (Oct. 1983~.
A] Cooke, P., Are accents out?, New York Times Magazine (Nov. 19, 1989),
50.
t10] Dempster, A. P., N. M. Lair~l, and D. B. Rubin, Maximum likelihood
from incomplete data via the EM algorithm, '7. R. Stat. Soc., B 39
(1977), 1-38.
t11] DeviJver, P. A., and M. M. Dekesel, Learning the parameters of a hicIden
Markov random field image model: A simple example, pp. 141-163
in Pattern Recognition Theory and Applications, P. A. DeviJver and
J. Kittler, eds., NATO AST Series, vol. F30.
[12] IBM speech recognition group, A read-time, isolated word, speech recog-
nition system for dictation transcription, IEEE Inter. Conf. Acoust.
Speech Signal Process. (Mar. 1985~.
[13] Itakura, F., Minimum prediction residual principle applied to speech
recognition, IEEE Trans. Acoust. Speech Signal Process. 23 (1975), 67-
72.
t14] Lee, Kai-Fu, Large- Vocabulary Speaker-Independent Continuous Speech
Recognition: The SPHINX System, Ph.D. thesis, Computer Science
Department, Carnegie Mellon University, Pittsburgh, 1988.
234
[15] Lee, Kai-Fu, and H. W. Hon. Speaker-independent phone recognition
using hidden Markov models, IEEE Trans. Acoust. Speech Signal Pro-
cess. 37 (1989), 1641-1648.
[16] Lesser, V. R., R. D. Fennell, IJ. D. Erman, and R. D. Reddy, The
Hearsay IT speech understanding system, IEEE Trans. Acoust. Speech
Signal Process. 23 (1975), 11-24.
[17] Lippman, A., Data determined Markov models for speech recognition,
Proceediangs of the 1990 Asi70mar Conference on Signals, Systems, and
Computers (~to appear).
[18] Lowerre, B. T., The HARPY Speech Recognition System, Ph.D. the-
sis, Computer Science Department, Carnegie Mellon University, Pitts-
burgh, 1976.
[19] Makhoul, J., S. Roucos, and H. Gish, Vector quantization in speech
coding, Proc. IEEE 73 (1985),1551-1588.
[20] Rabiner, I,. R., J. G. Wilpon, and F. K. Soong, High performance
connected digit recognition using hidden Markov models, IEEE Int.
Conf. Acoust. Speech Sigrzat Process. (Apr. 1988~.
[21] Wilpon, J. G., L. R. Rabiner, and A. Bergh, Speaker-independent iso-
lated word recognition using a 129-word airline vocabulary, J. Acoust.
Soc. Amer. 72 (1982), 390-396.