Read "Spatial Statistics and Digital Image Analysis" at NAP.edu

« Previous: 10. Stereology

Page 217 Cite

Suggested Citation:"11. Markov Models for Speech Recognition." National Research Council. 1991. Spatial Statistics and Digital Image Analysis. Washington, DC: The National Academies Press. doi: 10.17226/1783.

Page 218 Cite

Page 219 Cite

Page 220 Cite

Page 221 Cite

Page 222 Cite

Page 223 Cite

Page 224 Cite

Page 225 Cite

Page 226 Cite

Page 227 Cite

Page 228 Cite

Page 229 Cite

Page 230 Cite

Page 231 Cite

Page 232 Cite

Page 233 Cite

Page 234 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

11 Markov Moclels for Speech Recognition Alan F. Lippman University of Washington ~ I. ~ Introduction The goal of speech recognition research is to enable machines to reduce human speech to a list of printed words. By imposing restrictions such as a limited vocabulary and grammar, automated speech recognition systems have been produced during the last 15 years that, within those constraints, approach human performance. Sustained research efforts have resulted in systems that place substantially fewer restraints on the speaker. Earlier recognition systems typically required that the words spoken belong to a fixed small vocabulary, that the speaker pause between each word, and that the speaker participate in a training period during which the system would automatically adjust to that particular speaker (and no other). Each of the constraints above was used to reduce the complexity inherent in natural speech. This chapter presents an introduction to concepts under- lying much of the work in speech recognition anti, in the process, explains how the constraints above simplify the problem. The chapter then presents a detailed description of a simple probabilistic speech recognition system, modeled after the SPHINX system t14], that implements hidden Markov moclels (HMMs). Hidden Markov models are the basis for many recently developed speech recognition systems and are related to Markov Random Fields (MRFs), which have been successfully applied to some image-processing tasks (see chapter 3~. Both approaches rely on similar probabilistic frameworks to de- scribe and exploit the relationship between the items of interest (e.g., the 2~7

218 microphone recording and its transcript, the degraded and "true" images). Both approaches share some fundamental tools: maximum likelihood esti- mation to estimate parameters, maximum a posterior) (MAP) estimation to perform recognition/restoration, and the use of Markovian models to make these estimation problems tractable. However, recorded speech, unlike im- ages, is not a spatial process, it is a function of pressure (or through a microphone, voltage) versus time. This chapter provides a quite different view of some of the methods of chapter 3. Il.2 Basic Speech Concepts Some of the basic units of speech are the sentence, the word, and the phoneme. There are approximately 40 phonemes, each corresponding to a distinctive sound. The symbol list that dictionaries provide after each word (as a pronunciation guide), is a list of phonemes. Phonemes, words, and sentences are all fundamental to the way we produce speech; people think in terms of sentences and words, while phonemic descriptions are nec- essary for people to pronounce words correctly. Modern speech recognition systems function in a similar way: a hierarchical approach is used where sentences are modeled as series of words; words are modeled in terms of phonemes; and phonemes are modeled as series of features of the signal. By nesting these three layers of models, printed sentences (our desired goal) are related to the speech signal. Neither words nor phonemes are easily identified in the speech signal. Even finding the boundaries between two words can be a difficult task, since in the speech signal, phonemes as well as words may merge together and no simple signal processing can separate them. A spoken phrase like "this is," can easily be interpreted as "the sis" (or even by some graduate students as "thesis"~. Most people have had the experience of listening to a foreign language and hearing only a steady stream of sounds, illustrating that one must understand speech to find word boundaries. (This type of difficulty is familiar to those who work in image segmentation; it is often impossible to find the boundaries between objects without some knowle(lge about what the objects are.) Isolated-word recognition systems require that the boundaries between words be obvious. This is typically accomplished by requiring a speaker to pause between words. These silent pauses can be easily identified in the signal. Individually spoken words also tend to be enunciated more clearly,

219 aiding recognition. The main drawback of such systems is that the speaker is constrained to speak in a slow, unnatural manner. Continuous-speech recognition systems lack this constraint. The construction of even an isolated-word speech recognizes is difficult. The speech signal associated with a word or a phoneme is extremely vari- able, and can vary greatly depending on both the speaker's identity and the manner in which the word was spoken. Variability is caused by anatomical differences between speakers, such as sex or vocal tract length, as well as by differences in style, health (presence or absence of a cold), speaking rate, stress, or accent. A quickly spoken word will frequently be slurred or have parts "missing." Accents can result in the wholesale shifting of vowels [94. In addition, some words have many allowed (as opposes! to recommended) pronunciations. This is especially true for common wordsq like "the`" that are typically articulated poorly ("the" is often pronounced as "dee," "dah," "dih," "thah," or "thigh". ~ bpeaker-ctepen~ent systems simplify the recogni- tion problem by adapting themselves to one particular speaker, removing some of the causes of variability. The speech signal associated with a Phoneme also varies depending on the context in which it is pronounced. This effect is called co-articulation. As people speak, the tongue, lips, and other articulators must move from a position necessary to pronounce the current phoneme to a position that will result in the next phoneme being produced. The articulators tend to move only far enough for speech to be intelligible, so current positioning is affected by the previous positioning. The placing of the articulators can also be influenced by the position that should be occupied next, a form of "thinking ahead" that people perform automatically. For small-vocabulary recognition systems, the concept of phonemes typ- ically is not used. Many examples of each word are used to discover a direct relationship between the word and the speech signal. In this way the effect of co-articulation is modeled implicitly. Since the required number of examples will grow linearly as a function of vocabulary size, this type of approach is almost impossible for vocabularies containing more than a thousand words. Phonemes, or some similar concept, are often employed by large-vocabulary systems. Perhaps the most challenging problem in speech recognition research is that of modeling sentences. Uniess words are enunciated very clearly, con- fusions between phonetically similar words are inescapable. While a person would pick the sentence that makes the most sense, an automated system must rely on a sentence model. Different types of models have been used,

220 ranging from classification trees [1] to N-step Markov chains t2] (the proba- bility of the current word conditioned on all past words being only dependent on the previous N words). If the speaker obeys a highly constrained gram- mar specified by the system (e.g., the word "come" can only be followed with "here," "over," or "back"), it becomes much easier to automatically recognize sentences. A measure of the constraining power of a grammar is the perplexity (for a definition see [24~. Automated systems that allow large vocabularies and employ a grammar whose perplexity is close to the perplexity of spoken English, can be said fairly to handle natural tasks. It.3 Some Recent Systems All speech recognition systems require restricted input to achieve good accu- racy (around one word in twenty wrong). Table 11.1 provides a short guicle to some recent systems and the constraints under which they operate. Almost universal is the requirement that the speech be recorded in a Tow-noise environment; a device that operates in a cocktail lounge or on a construction site may not be seen for decades. Other standard requirements are describer! by the terms continuous speech, speaker indepenclent, large vo- cabulary, and natural task. Large vocabulary systems in this table recognize more than three hundred words. TABLE 11.1: Some Speech Recognition Systems and Their Abilities SYSTEM I DATE | Speaker | Continuous | Large | Natural Ta<k Independence Speech Vocabulary NTT 1975 No No No No . DRAGON 1975 No Yes No No HEARSAY 1975 No Yes Yes No HARPY 1976 No Yes Yes BELL '82 1982 Yes No No No FEATURE 1983 Yes No No No _ TANGORA 1985 No No Yes Yes BYBLOS 1987 No Yes Yes No ~ BELL '88 ~ ~ 1988 Yes Yes No ~ No | SPHINX | 1988 l ''es l Yes | Yes | No

221 11.4 Signal Processing Speech signals are typically sampled at S-16 kHz (S to 16 thousand samples per second), where the value of each sample is specified by 10-12 bits. The first step in most recognition systems is to process this signal, both to reduce the amount of data and in an attempt to extract the features of the signal that are relevant for recognition. Some processing methods begin by taking short-term Fourier transforms; short (on the order of 20 milliseconds), overlapping (by on the order of 10 milliseconds) frames of the signal are transformed into vectors whose components are the energies associated with (approximately 10) differing frequency bands. In this manner the speech signal would be represented by a small number (on the order of 100) of vectors per second. Other processing methods fit autoregressive (AR) moclels (of order ~ to 16) to these short, overlapping frames of the speech signal. In that approach the speech signal is represented by a sequence of AR parameter vectors. Note that whereas each frame of the signal may contain 320 values (20 ms of a 16 kHz sampled signal), this first processing step reduces it to a vector of dimension 16 or less. The final step of most processing algorithms is the use of vector quan- tization [19i, which reduces each vector to an acoustic label belonging to a small discrete set (containing on the order of 256 elements). Briefly described, the use of vector quantization first requires that stan- dard techniques be used to find cluster centers in a set of vectors obtained from a representative variety of recorded speech. Each of these cluster cen- ters is given a label. Vector quantization replaces any vector in the high- dimensional space with the label of the closest cluster center. In this way, a 16-dimensional vector of AR parameters could be represented by a single byte. Although good signal processing is critical to the successful performance of a recognition system, it is beyond the scope of this discussion, and we refer the reader to [2], [14j, and [7] for further details. For the remainder of this discussion, it is assumed that the speech signal is already described by a series of acoustic labels, each of which belongs to a small, fixed set. Il.5 Probabilistic Recognition The most successful approaches to the speech recognition problem use proba- bilistic modeling. The processed speech signal y = Lye, . . ., An) is considered to be an observation of n random variables (R.V.s) Y = (Y~, . . ., IN ). A =

222 sentence or string of words w = (w~,...,wm) is considered to be an obser- vation of m R.V.s W = OWN, . . ., Wm). For a fixed series of recorded acoustic labels y, the value of the conditional distribution P(W=w~Y=y) specifies the probability that the words w are the "script" of the recording. Speech recognition can be accomplished by finding the word string that maximizes this conditional distribution. (This is MAP estimation.) By Bayes' rule, P(W= why =y) pty=y,W=w)/P(y=y) P(Y = yew = w)P(W = w)/P(Y = y) For any fixed recording, the value of P(Y=y) is a constant and the w that maximizes P(W = why = y) also maximizes P(Y = y, W = w) and P(Y = yew = w)P(W = w). Instead of constructing the conditional distribution P(W = why = y), we shall consider two, wholly separate problems. The first is the construction of a distribution P(W = w) on sentences; this is the modeling of grammar. The second is the construction of a distribution P(Y= yew= w) on acoustic label strings; this is the modeling of speech production. The remainder of this chapter is designed to provide a brief introduction to the techniques used to implement the above approach. The construction of P(W = w) is not discussed. (The interested reader should refer back to §11.2 for a few references regarding the choice of a grammar.) We concen- tate instead on presenting in some detail a simplified parametric model for P(Y=y~W=w) similar to the SPHINX baseline system [14], which formed the basis for the SPHINX system, a successful large-vocabulary, continu- ous speech, speaker-indepenclent recognition system. (We recommend [14] as a detailed guide to the construction of a complex and functional speech recognition system.) Discussion follows on estimating the parameters of this distribution (511.11~. The final topic is the identification of the word string that has maximal probability conditioned on the spoken signal. 11.6 Image-Processing Parallels This probabilistic approach to speech recognition has many points in com- mon with Bayesian image processing using MRFs. A typical digital picture is a specification of intensities or colors at the sites of a finite two-dimensional lattice 1; = {(i, j)~'N=o. Modeling can be accomplished by introducing two

223 types of random variables, F={Fij}ij~L, corresponding to an observed pic- ture, and U = {Uij~ijeL, corresponding to an unobserved "true" picture. One method of image restoration is to calculate for any observed image f the u that maximizes P(U=u~F=f). Bayes' rule is applied, just as it was in speech, to reduce the construction of P(U=u~F=f) to the construction of two separate distributions: P(U = u) and P(F = flu = u). P(U = u) is caned the prior distribution and has the same role that P(W = w), the sentence model, did for speech. P(F = MU = u) is the degradation model, and is the analogue of the speech production model. Both image-processing tasks and speech recognition require the place- ment of distributions on very large (~ 2~°°°), but finite, spaces. Both rely on Markov assumptions to allow computational feasibility. Major differences are that the speech problem is inherently one-dimensional, whereas pictures are multidimensional. The inherent one-dimensionality of speech signals al- tows the use of fast estimation and search techniques. Although some image models allow the use of similar techniques t11], the class of such models is highly restricted and may not be particularly useful. I l.7 Mocleling of Speech The recognition system we describe uses phoneme models as an intermediate stage between acoustic labels and words. For every phoneme we will form a distribution on acoustic label strings produced during the enunciation of the phoneme. These distributions take the form of HMMs. In our simplified presentation, the effects of co-articulation wiD be ignored; the distribution of the acoustic labels associated with a given phoneme will be assumed to be independent of the neighboring phonemes. A more ambitious speech recognition system would model phonemes in context. In such a system, the goal would still be to put a distribution on acoustic strings produced during the enunciation of the phoneme. How- ever, the distribution would also depend (commonly) on the preceding and following phonemes (e.g., a distribution would be formed for the acoustic label strings associated with the phoneme OR\ when that phoneme is in the phoneme string \TB\ \R\ \ZY\~. The fundamental difficulty of this ap- proach is that the number of such (distributions would be approximately 403, and parameter estimation becomes impossible. However, clever implemen- tations of context-dependent phoneme models have been made, and we refer the reader to [14] for details.

224 The use of phoneme models (either context-dependent or context-inde- pendent3 usually necessitates the assumption that the distribution of the acoustic labels produced during a given phoneme, or that lies within a given context for context-dependent models, be independent of the acoustic labels produced during other phonemes. This assumption allows the production of acoustic labels for a string of phonemes to be considered as though each phoneme was produced independently and the results concatenated. This type of assumption is necessary in order to build models by "instantiation," a technique that is described in §11.10. Although many words have multiple pronunciations this model assumes con ~ ~ ~ ~ . . . . . that each word has a unique pronunciation, and therefore a unique phonemic representation. This assumption is used in some state-of-the art systems. Such an assumption forces the phoneme models to model implicitly some of the variability in pronunciation. In the remainder of this chapter it is assumed that a grammar, and thus P(W = w3, has been specified. The modeling strategies and assumptions above will be used to produce P(Y = y jW = w3. Combining this with the value for P(W = w3 yields P(W = w, Y = y3. Il.S Hidclen Markov Modeling First, we introduce hidden Markov moctels (HMMs3, and then describe their use in speech recognition. A HMM describes the behavior of two series of discrete R.V.s, cad them X = (X~,X2,...3 and Y = (Y~,Y2,.~-3 The X series is caHect the hidden R.V.s, and the other series is called the observed R.V.s. The conditional distribution of Xi given the values of all the previous R.V.s (Xj, Yj,Vj < it will be dependent only on the value of Xi_~ (and not on it. The conditional distribution of Yi, given the values of all other R.V.s (both hidden and observed), will be dependent only on the value of Xi and Xi_~ (andi not on i): P(Xi=xi~Xj=xj,Yj=yi'~j < i) P(Yi = yi~Xj =Xj, Yj = yj, JO ~ i) P(Xi = Xi Xi-i = xi-i ~ P(Yi=Yi~Xi=Xi,Xi-l =Xi_~) - The "hidden" part of a HMM arises from its use. Only observations of the R.V.s Y will be used to estimate the transition matrix P(Xi~Xi_~) and the conditional distribution P(Yi ~Xi, Xi_, ). These conclitional probabilities are sometimes called the parameters of the modes.

225 For each phoneme, we construct a distribution on acoustic label strings y and hidden state strings x. Based on these distributions we can construct the distribution P(Y = y, X = x~W = w). Note that this distribution specifies the distribution P(Y = yaw = w). 11.9 The SPHINX Phoneme Mode} We now present a model for the production of one phoneme. Note that although a phoneme may be pronounced either quickly or slowly, the acoustic labels are produced at a constant rate. A phoneme model must allow the production of variable numbers of acoustic labels. As shown in Figure 11.1~ 1 ~~ ~ ~~1,~ ~ ~~ ~11 ~ ~ _.~1 ~~ an_ ~ _~11 ~ ~ ~~ ~ ~ - , el Hi bake on seven allowed values, and call sued o1,..,S7. X1 equals S with probability 1. In Figure 11.1, arrows connect two states Si and Sj (possibly the same) if P(Xk=Sj~Xk_~=St) is ahowed to be nonzero. With every allowed transition (from Si to Sj) there is an associated character, B. M, or E, denoting the beginning, middle, or end of the pronunciation of the phoneme. The distribution of Yi conditioned on observations of all the other R.V.s wid only depend on the character associated with the transition from Xi_~ to Xi (e.g., from Figure 11.1, P(Yi=yt~Xi-~=s2'Xi=s3) = PM(Yi= Yip. When S7 iS reached, the pronunciation of the phoneme is complete. Note that PB, PM, and PE are distributions on Y. and recall that an acoustic label can typically have any one of a few hundred values. The distributions PB, PM, and PE will not be parametrized, so the specification of each distribution requires the specification of many probabilities. For the non-baseline SPHINX system, approximately 700 probabilities are needed to describe each of the distributions PB, PM, and PK. Hidden Markov models possess desirable properties. The observed R.V.s (Y) behave in a very simple fashion (they are independent) when conditioned on the hidden R.V.s (X), but the marginal distribution of Y possesses no such independence. The behavior of the observed R.V.s is quite compli- cated. The model above allows variable length label strings, and allows the probabilities of strings of length 1, 2, and 3 to be anywhere between O and 1. The loops at states S2, S3, and S4 allow the phoneme to idle at a particular state, helping to model the various ways in which phonemes can be extended in duration (e.g., some phonemes may occasionally have tong "middles," other phonemes may always have short "middles"~. ~ r ~ _ ~ \

226 Q - Ma it'< Y _~7J - - FIGURE 11.1: Allowed transitions. 11.10 Instantiation We now use the phoneme models and the assumptions listed in the por- tion on modeling to construct P(Y = y, X = x~W = w). The distribution P(Y = y, X = x~W = w) is formed by using the values of the probabili- t~es ~~_~' and ~tY`~,~_~' as specified by appropriate phoneme models. The manner in which this is done is called "instantiation," and is described below. For each sentence w = Owl, we, .., wm) there is an associ- ated string of phonemes p = (Pi, Pi, · · ·, PI ~ ~ P2 P2 2 pnm ~ where pi is the Ah phoneme in the ith word. The total number of phonemes associated with the sentence is n = Hi nit The distribution P(Y = y, X = x~W = w) is a HMM where the hidden variables Xi can take on values in . . _ ~ _ _ ~ _ ~ ~ ~ ~ ~ ~ ~ . ~ ~ a {S1(P1)' · · · ~ 56(P1), S1(P1)' · · ~ 56(P1), · · ~ Sl(Pm )' ~ 57(pmm)} ~ This set contains 6n+ 1 states, six for each phoneme and one state signifying the end of the sentence. Note that the state of the hidden variable Xi specifies the word, phoneme, and part of phoneme that is being produced at time i. The distribution P(Y = y,X = x~W = w) is formed by defining the transition probabilities P(`X~ = Sj(~pk)~X~_~ = Si(~Pt )), 1 ~ i, j < 6, 1 < k < m, 1 < ~ < nk to be the same as those specified by the phoneme model for the phoneme

227 ~ (SH) M ($ H) E(s H) \ ~ \ / FIGURE 11.2: Graphical representation of the HMM for the word string she. pi. The values of P(X~=S~(pik i)~X~_~=S,fpi)), 1 ~ i < 6, ~ < k < m, 1 < ~ < nk P(X~=S~ (pk+~X~_~ =Si~pi )), 1 < i < 6, 1 < k ~ m, ~ = nk P(X~ = S7 (pi )X_ = Sit )), 1 < i < 6, k = m, ~ = nm have the values that transitions to S7 had in the phoneme mode! for Pk- AD the other transition probabilities have value zero. When X~ is observed to have value S7(p7tmm), the sentence has been completed. The distributions P(Yi~Xi,Xi_~), similarly, are those associated with the related phoneme model. To account for between-word pauses, a nonzero term P(X' = So (pk)~X'_~ = So (pk)) can be added where PLY = silence~X~ = Sl,fpk),X~_~ = Sl (Pk)) = 1; for the sake of clarity, we wit! not discuss this detail. Figure 11.2 is a graphical representation of the HMM for the word string she. Notice that the distribution P(Y = y, X = x~W = "she") contains no explicit mention of phonemes or words. It is a distribution on strings of acoustic labels and hidden states. As indicated previously, any string of hidden states x is associated with a unique word string. Note that this forces the value of P(Y=y,X=x,W= w) to be either zero or equal to P(Y=y, X =x).

228 11.~l Computational Issues: Parameter Estimation ant! Searches The model for P(W = w,Y = y) is based on phoneme models. The esti- mation of parameters for this distribution consists primarily of estimating the parameters of the phoneme models. The simplest way to estimate the parameters of phoneme models would be to excise examples of phonemes from speech, and estimate parameters for each phoneme separately. This is not done, perhaps because the excision of phonemes (accomplished by hand) cannot be done in a manner consistent with the assumptions implicit in the phoneme models. Most current systems estimate parameters using training data consisting of utterances of whole known sentences. Below we present the algorithm used to perform this very Circuit computational task. We also briefly present the typical approach used to speed the computations associated with the use of a trained system. We wish to estimate the parameters of a HMM from one or more obser- ~r~tir~nc Of l<V Vim It Afar e=~ r~-r_;r`~;~;`rm +1~+ +1~ ~~- '~.~ - ~1 y~ 1' . . ., 11 j. a~ 111~' ow~111 ~~1~1-lllU"lUlV~ U11~L L1~= ~1~11 of a HMM can be estimated. We are, after all, trying to estimate the be- havior of variables that are never observed. However, some thought yields examples of HMMs where we should be able to estimate parameters. For example, consider the simplest possible HMM. Let both the hidden and ob- served values take on only the values O and 1. Let P(Xi= 1~Xi_i = 1) = .9 and P(Xi = 0X_ = 0) = .7, and let P(Yi = Maxi = 1) = .85 and P(Yi = 0~Xi = 0) = A. These four terms completely describe the behav- ior of the HMM. An excerpt from a simulation of this HMM is below: 000000000011001111111111111110111111001111111111111111110111 110000000011001111111111111111111111001111110110010111010111 Notice that there are Tong strings of both Is and Os in the observation y. Remember that the Yi are conditionally in~lependent given X. The strings must be caused by the behavior of the hidden variables. Since Tong strings exist, we can guess that there is not much "noise," and that P(Yi = Taxi = 0) and P(Yi = 1IXi = 1) should both be close to 1 or both close to 0. (The distribution of Y given X might turn most Os to Is, and most Is to Os. It is impossible to tell from the observations whether the underlying process is as above or is governed by P(Xi = 0~Xi_i = 0) = .9, P(Xi = 1~Xi_i = 1) = .7, P(Yi = taxi = 1) = .85 and P(Yi = taxi = 0) = A.) We can then guess that both P(Xi = 0~Xi-i = 0) and P(Xi = lax_ = 1) should be ~ .5 in order

229 to consistently produce strings, ant! that one of these terms shouIc! be > .8 because a string of length 22 would otherwise be quite unlikely. One of the reasons for the success of HMM is the existence of a compu- tationaLy efficient method for approximate maximum likelihood parameter estimation (as opposed to the completely ad hoc estimation above). Starting with an initial guess for the parameters, and an observation y = (ye, . . ., AT) of Y = EYE,... ,YT), the Baum-Welch algorithm [5,4] (also known of as the forward-backward aIgorithm) is an iterative method for selecting parameters that ensures that at every iteration the likelihood increases. Convergence to a local maxima of the likelihood is guaranteed. The Baum-Welch algorithm is equivalent to, and predates, the expectation maximization (EM) method [105. We present here the Baum-Welch algorithm. Let ajk = P(Xi=k~Xi_~=j) j,k= 1,...,s bJk(~) = P(Yi = ll~Xi = k, Xi_ = j) j, k = 1, . . ., s, ~ = 1, . . ., r . Assume, for simplicity, that aj = P(Xo = j) and bj (lo = P(YT = ~ XT = j) are known. Let Pab be the distribution on (X0,...,XT, Y1,- -,YT) generated by a and b. The likelihood of y is ~= P(Y = y, X = x), which can be written as s Pab(Y = y) = At, ado afoul boot ( Y1 tail j2 bil j2 (Y2 ~ · · · atl iT bar ( YE ~ . . . }0,}1 ,...,:~1 Note that a naive calculation of the expression above would require an ex- treme amount of computation, the performance of 2T x sT+i multiplication operations. A more efficient approach is to rewrite it as Pab(Y=Y) = ~' ajO ~ ajoj~bjoji(yi) ~ ajij2bji j2(y2) · · · ~ ajT~jTbjT(YT) Jo = ~ Jo = ~ i2=1 AT=} and perform the computation by first calculating ~jTajT_:jTbjT(YT) and storing its value for ah values of jet-. Then the sum over iT-i can be com- puted and stored for ah values of iT-2. Repeating this until the likelihood is evaluated results in 2T x S2 multiplications being conducted. Each iteration of the Baum-Welch algorithm results in new estimates (ajk) and {bjk(1~:, based on the slate y and on the previous estimates jacks and (bJk(~) - Ei Pab(Xi-l =j'Xi =k~Y=y) j Zi Pab(Xi~ = jay = y)

230 b~ (1) Pi you Pab(Xi1 =i, Xi = klY = y) Zi Pab(xi-l = i' Xi = klY = y) These re-estimation equations are the heart of the algorithm. For those more familiar with the EM algorithm, the above formulas can be interpreted as the calculation of ~iEab(X~j~k,`~(Xi'Xi_l,Yi)~Y=y), where Xi denotes the indicator function for the value i; Xi(j) = 1 if i = j, and 0 otherwise. These indicator functions are the natural sufficient statistics when a HEM is represented as an exponential family (which can be accomplished via the Gibbs-MRF equivalence). The maximization step of the EM algorithm becomes trivial in this case. We shall not present any proof that the above re-estimation increases the likelihood of y; instead we refer the reader to t4,54. The Baum-Welch algorithm in addition to the formulas above, specifies a method! to calculate the new estimates quickly. This is essential since the distribution Pab(Xi_~ = i,Xi = key = y) is dependent on all the values of Y. and a naive calculation would require as many operations as a naive calculation of the likelihood. However, we can implement a computational strategy similar to that introduced above to compute thelikelihood. The re-estimation equations above can be written as a Ei °ti(i)Fi+1 (k)ajkbjk(Yi+1 ) ok Hi (>i(j)3i(i) b (I) ~' :y' =i ~ (A )>i+l ( k )a jk by k ( Yi+ l ) where cat and ,l] can be defined inductively in i, forward and backward re- spectively, by s s 0ti+1(k) = ~°bi(i)ajktjk(Yi+~), pi(i) = ~>i+l(k)aJkbjk(Yi+l)- j=] j=1 The implementation of the Baum-Welch algorithm for speech recognition depends, obviously, on the form of training data. The standard scenario, as mentioned above, is that there is a list of known sentences and pronuncia- tions of them. We can therefore construct a HMM P(Y=y, X=x~W =w) for each known sentence (as in §11.10~. The estimation of parameters for the HMMs can proceed by the Baum-Welch algorithm with the simple mod- ification that the iterations be performed synchronously. One iteration will be conducted for each HMM sentence model, then the estimated parameters

231 for each sentence will be combined into one estimate for Al the parameters of all the phoneme models at the end of every iteration. The performance of the Baum-Welch algorithm depends highly on the quality of the initial estimates. While every iteration of the algorithm guar- antees an increase in the likelihood, a bad initial guess may cause con- vergence to a bad (Iow) local maximum of the likelihood. Whereas whole sentences are used to train the phoneme models, the initial estimates for the distributions P(Yi~Xi,Xi_~) often come from excised phoneme data. In [15], for every phoneme the three distributions PB(Yi), PM(Yi), and PE(Yi) are initialized to a histogram of the acoustic label values associated with that phoneme in hand-labeled data. The initial estimates of the transition probabilities of the hidden states were chosen so that all allowed transitions from a state had equal probability. Another interesting implementational detai!is that the Baum-Welch al- gorithm is typically used for only 2 or 3 iterations A, page 35], [144. On the other hand, EM is well known for slowness after the first few iterations. In [14] the performance of the recognition system is stated to worsen with continued iterations, suggesting to us that an overfit of the training data is occurring. Once parameter estimation is accomplished, there remains the use of the recognition system. As was stated at the beginning of this section, our goal is the calculation of the string w that maximized P(W = why = y). The use of Bayes' rule aBowed us to modify this to the calculation of the string w that maximized P(Y = y,W = w). Recall that we have constructed P(Y = y, X = x, W = w), which equals P(Y = y, X = x) when w is the word string associated with x, and zero otherwise. The string w that maximizes P(Y=y,W=w) is usually approximated by the w associated with the x that maximizes P(Y = y,X = x). The principal justification for this approximation, besides its success and com- putational simplicity, is that the most likely word string should have at least one corresponding hidden state string that win also be very likely. The most likely string of hidden states, for small vocabulary and simple grammar sys- tems, can be found by a simple dynamic programming [6] scheme called the Viterbi algorithm [2~. For more complicated systems the search is performed by more ad hoc methods that are based on dynamic programming [2,14] 11.12 Future Directions The construction of a large-vocabulary, speaker-independent, complicated- grammar speech recognition system from scratch is a daunting task. How-

232 ever, it is one that new researchers interested in speech recognition will not have to face. Databases for speech recognition are becoming commonly avail- able. As a result, various fascinating and extremely challenging subproblems can bee approached by single researchers on current generation workstations. One such problem is the speaker-independent recognition of phonemes in continuous speech; another is the recognition of connected digits. Whereas HMMs have been the most successful approach to date, the fun- damental reason for their current superiority is the dedication and creativity of those who have implemented them. Preliminary research indicating that other approaches can be as accurate and computationally feasible is pre- sented in [174. It is hoped that, as the computational resources to approach the speech recognition problem become available to a larger community, a diversification of approaches wiD occur and that this chapter encourages research in this direction. Bibliography [1] BahI, L. R., P. F. Brown, P. V. De Sonza, and R. L. Mercer, A tree based statistical language model for natural language speech recogni- tion, IEEE Trans. Acoust. Speech Signal Processing 37 (1989), 1001- 1008. [2] Bahl, L. R., F. Jelinek, and R. L. Mercer, A maximum likelihood ap- proach to continuous speech recognition, IEEE Trans. Pattern Anal. Machine Intel. 5 (1983), 179-190. [3] Baker, J. K., The DRAGON systemAn overview, IEEE Trans. Acoust. Speech Signal Process. 23 (1975), 24-29.

233 [4] Baum, L. E., An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes, in Inequalities Ill, Academic Press, 1972. t5] Baum, L. E., T. Petrie, G. Soules, and N. Weiss, A maximization tech- nique occurring in the statistical analysis of probabilistic functions of Markov chains, Ann. Math. Stat. 41 (1970), 164-171. [6] Bellman, R. E., Dynamic Programming, Princeton University Press, Princeton, 1957. [7] Brown, P. F., The Acoustic-Mode1ting Problem in Automatic Speech Recognition, Ph.D. thesis, Computer Science Department, Carnegie MeHon University, Pittsburgh, 1987. i8] Cole, R. A., R. M. Stern, M. S. Phillips, S. M. Brill, P. Specker, and A. P. Pilant, Feature-based speaker independent recognition of English letters, IEEE Inter. Conf. Acoust. Speech Signal Process. (Oct. 1983~. A] Cooke, P., Are accents out?, New York Times Magazine (Nov. 19, 1989), 50. t10] Dempster, A. P., N. M. Lair~l, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, '7. R. Stat. Soc., B 39 (1977), 1-38. t11] DeviJver, P. A., and M. M. Dekesel, Learning the parameters of a hicIden Markov random field image model: A simple example, pp. 141-163 in Pattern Recognition Theory and Applications, P. A. DeviJver and J. Kittler, eds., NATO AST Series, vol. F30. [12] IBM speech recognition group, A read-time, isolated word, speech recog- nition system for dictation transcription, IEEE Inter. Conf. Acoust. Speech Signal Process. (Mar. 1985~. [13] Itakura, F., Minimum prediction residual principle applied to speech recognition, IEEE Trans. Acoust. Speech Signal Process. 23 (1975), 67- 72. t14] Lee, Kai-Fu, Large- Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System, Ph.D. thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, 1988.

234 [15] Lee, Kai-Fu, and H. W. Hon. Speaker-independent phone recognition using hidden Markov models, IEEE Trans. Acoust. Speech Signal Pro- cess. 37 (1989), 1641-1648. [16] Lesser, V. R., R. D. Fennell, IJ. D. Erman, and R. D. Reddy, The Hearsay IT speech understanding system, IEEE Trans. Acoust. Speech Signal Process. 23 (1975), 11-24. [17] Lippman, A., Data determined Markov models for speech recognition, Proceediangs of the 1990 Asi70mar Conference on Signals, Systems, and Computers (~to appear). [18] Lowerre, B. T., The HARPY Speech Recognition System, Ph.D. the- sis, Computer Science Department, Carnegie Mellon University, Pitts- burgh, 1976. [19] Makhoul, J., S. Roucos, and H. Gish, Vector quantization in speech coding, Proc. IEEE 73 (1985),1551-1588. [20] Rabiner, I,. R., J. G. Wilpon, and F. K. Soong, High performance connected digit recognition using hidden Markov models, IEEE Int. Conf. Acoust. Speech Sigrzat Process. (Apr. 1988~. [21] Wilpon, J. G., L. R. Rabiner, and A. Bergh, Speaker-independent iso- lated word recognition using a 129-word airline vocabulary, J. Acoust. Soc. Amer. 72 (1982), 390-396.

Next: Plates »

Spatial Statistics and Digital Image Analysis (1991)

Chapter: 11. Markov Models for Speech Recognition

Welcome to OpenBook!

Get Email Updates