Skip to main content

Currently Skimming:

State of the Art in Continuous Speech Recognition
Pages 165-198

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 165...
... , have combined to make high-accuracy, speakerindependent, continuous speech recognition for large vocabularies possible in real time, on off-the-shelf workstations, without the aid of special hardware. These advances promise to make speech recognition technology readily available to the general public.
From page 166...
... Rather than being mostly a laboratory endeavor, speech recognition is fast becoming a technology that is pervasive and will have a profound influence on the way humans communicate with machines and with each other. This paper focuses on speech modeling advances in continuous speech recognition, with an exposition of hidden Markov models (HMMs)
From page 167...
... We will argue that future advances in speech recognition must continue to rely on finding better ways to incorporate our speech knowledge into advanced mathematical models, with an emphasis on methods that are robust to speaker variability, noise, and other acoustic distortions. a THE SPEECH RECOGNITION PROBLEM Automatic speech recognition can be viewed as a mapping from a continuous-time signal, the speech signal, to a sequence of discrete entities, for example, phonemes (or speech sounds)
From page 168...
... It is important to note that a significant and important amount of speech knowledge is incorporated in the structural model, including our knowledge of language structure, speech production, and speech perception. Examples of language structure include the fact that continuous speech consists of a concatenation of words and that words are a concatenation of basic speech sounds or phonemes.
From page 169...
... This is typical of continuous speech; the words are connected to each other, with no apparent separation. The human perception that a speech utterance is composed of a sequence of discrete words is a purely perceptual phenomenon.
From page 170...
... To perform the necessary mapping from the continuous speech signal to the discrete phonetic level, we insert a model a finite-state machine in our case for each of the allophones that are encountered. We note from Figure 2 that the structures of these models are identical; the differences will be in the values given to the various model parameters.
From page 171...
... Hidden Markov Models A hidden Markov model (HMM) is the same as a Markov chain, except for one important difference: the output symbols in an HMM are probabilistic.
From page 172...
... Given the sample output sequence-C D A A B E D B A C C there is no way for sure to know which sequence of states produced these output symbols. We say that the sequence of states is hidden in that it is hidden from the observer if all one sees is the output sequence, and that is why these models are known as hidden Markov models.
From page 173...
... This algorithm Phonetic HMMs We now explain how HMMs are used to model phonetic speech events. Figure 5 shows an example of a three-state HMM for a single phoneme.
From page 174...
... model. As we enter into state 1 in Figure 5, one of the 256 output symbols is generated based on the probability distribution corresponding to state 1.
From page 175...
... That workshop prompted a few organizations, such as AT&T and BEN, to start working with HMMs (Levinson et al., 1983; Schwartz et al., 1984~. In 1984 a program in continuous speech recognition was initiated by the Advanced Research Projects Agency (ARPA)
From page 176...
... , and the references therein. Research results in this area are usually reported in the following journals and conference proceedings: IEEE Transactions on Speech and Audio Processing; IEEE Transactions on Signal Processing; Speech Communication Journal; IEEE International Conference on Acoustics, Speech, and Signal Processing; EuroSpeech; and the International Conference on Speech and Language Processing.
From page 177...
... However, because of the large variability of the speech signal, it is a good idea to perform some form of feature extraction to reduce that variability. In particular, computing the envelope of the short-term spectrum reduces the variability significantly by smoothing the detailed spectrum, thus eliminating various source characteristics, such as whether the sound is voiced or fricated, and, if voiced, it eliminates the effect of the periodicity or pitch.
From page 178...
... These feature vectors form the input to the training and recognition systems. Training Training is the process of estimating the speech model parameters from actual speech data.
From page 179...
... Typically, closed-set word classes are filled out- for example, days of the week, months of the year, numbers. After completing the lexicon, HMM word models are compiled from the set of phonetic models using the phonetic spellings in the lexicon.
From page 180...
... Then, given the sequence of feature vectors, the word HMM models, and the grammar, the recognition is simply a large search among all possible word sequences for that word sequence with the highest probability to have generated the computed sequence of feature vectors. In theory the search is exponential with the number of words in the utterance.
From page 181...
... Improvements in Performance The improvements in speech recognition performance have been so dramatic that in the ARPA program the word error rate has dropped by a factor of 5 in 5 years! This unprecedented advance in the state of the art is due to four factors: use of common speech corpora, improved acoustic modeling, improved language modeling, and a faster research experimentation cycle.
From page 182...
... In addition to the use of feature vectors, such as MFCCs, it has been found that including what is known as delta features the change in the feature vector over time can reduce the error rate by a factor of about 2 (Furui, 19861. The delta features are treated like an additional feature vector whose probability distribution must also be estimated from training data.
From page 183...
... Because only a small number of the possible feature vector values will occur in any training set, it is important to use probability estimation and smoothing techniques that not only will model the training data well but also will model other possible occurrences in future unseen data. A number of probability estimation and smoothing techniques have been developed that strike a good compromise between computation, robustness, and recognition accuracy and have resulted in error rate reductions of about 20 percent compared to the discrete HMMs presented in the section titled "Hidden Markov Models" (Bellegarda and Nahamoo, 1989; Gauvain and Lee, 1992; Huang et al., 1990; Schwartz et al., 1989)
From page 184...
... Sample Performance Figures Figure 7 gives a representative sampling of state-of-the-art continuous speech recognition performance. The performance is shown in terms of the word error rate, which is defined as the sum of word substitutions, deletions, and insertions, as a percentage of the actual number of words in the test.
From page 185...
... With the RM corpus, it has been found that the error rate is inversely proportional to the square root of the amount of training data, so that quadrupling the training data results in cutting the word error rate by a factor of 2. This large reduction in error rate by increasing the training data may have been the result of an artifact of the RM corpus, namely, that the sentence patterns of the test data were the same as those in the training.
From page 186...
... A general rule of thumb is that, if the total amount of training speech is fixed at some level, the SI word error rates are about four times the SD error rates. Another way of stating this rule of thumb is that, for SI recognition to have the same performance as SD recognition, requires about 15 times the amount of training data (Schwartz et al., 1993~.
From page 187...
... This would be especially needed for atypical speakers with high error rates who might otherwise find the system unusable. Such speakers would include speakers with unusual dialects and those for whom the SI models simply are not good models of their speech.
From page 188...
... The four speakers were native speakers of Arabic, Hebrew, Chinese, and British English. By collecting two minutes of speech from each of these speakers and using rapid speaker adaptation, the average word error rate for the four speakers decreased five-fold.
From page 189...
... The real-time feats just described have been achieved at a relatively small cost in word accuracy. Typically, the word error rates are less than twice those of the best research systems.
From page 190...
... Segmental features include any measurements that are made on the whole segment or parts of a segment, such as the duration of a segment. There have been few segmental models proposed, among them stochastic segment models and segmental neural networks (to be described in the next section)
From page 191...
... Using the N-best paradigm with segmental models, with N = 100, has reduced word error rates by as much as 20 percent. The N-best paradigm has also been useful in reducing computation whenever one or more expensive knowledge sources needs to be combined, for example, cross-word models and e-gram probabilities for n > 2.
From page 192...
... Figure 9 shows a typical feedforward neural network, that is, it has no feedback elements. Although many different types of neural nets have been proposed, the type of network shown in Figure 9 is used by the vast majority of workers in this area.
From page 193...
... In the case of segmental neural nets, the N-best paradigm is used to generate likely segmentations for the network to score. Using either method, reductions in word error rate by 10 to 20 percent have been reported.
From page 194...
... Schwartz, "Speech Recognition Using Segmental Neural Nets," IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, pp. I-625-628, March 1992.
From page 195...
... 1991-1994, 1986. Furui, S., "Unsupervised Speaker Adaptation Method Based on Hierarchical Spectral Clustering," IEEE International Conference on Acoustics, Speech, and Signal Processing, Glasgow, Scotland, paper S6.9, May 1989.
From page 196...
... Hild, H., and A Waibel, "Multi-Speaker/Speaker-Independent Architectures for the Multi-State Time Delay Neural Network," IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, pp.
From page 197...
... Ney, H.,"Improvements in Beam Search for 10000-Word Continuous Speech Recognition," IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, pp. I-9-12, March 1992.
From page 198...
... Krasner, and J Makhoul, "Improved Hidden Markov Modeling of Phonemes for Continuous Speech Recognition," IEEE International Conference on Acoustics, Speech, and Signal Processing, San Diego, pp.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.