Language Log: High-entropy speech recognition, automatic and otherwise

Regular readers of LL know that I've always been a partisan of automatic speech recognition technology, defending it against unfair attacks on its performance, as in the case of "ASR Elevator" (11/14/2010). But Chin-Hui Lee recently showed me the results of an interesting little experiment that he did with his student I-Fan Chen, which suggests a fair (or at least plausible) critique of the currently-dominant ASR paradigm. His interpretation, as I understand it, is that ASR technology has taken a wrong turn, or more precisely, has failed to explore adequately some important paths that it by-passed on the way to its current success. In order to understand the experiment, you have to know a little something about how automatic speech recognition works. If you already know this stuff, you can skip the next few paragraphs. And if you want a deeper understanding, you can go off and read (say) Larry Rabiner's HMM tutorial, or some of the material available on the Wikipedia page. Basically, we've got speech, and we want text. (This version of the problem is sometimes called "speech to text" (STT), to distinguish it from systems that derive meanings or some other representation besides standard text.) The algorithm for turning speech into text is a probabilistic one: we have a speech signal S, and for each hypothesis H about the corresponding text, we want to evaluate the conditional probability of H given S; and all (?) we need to do is to find the H for which P(H|S) is highest. We solve this problem by applying Bayes' Theorem, which in this case tells us that Since P(S), the probability of the speech signal, is the same for all hypotheses H about the corresponding text, we can ignore the denominator, so that the quantity we want to maximize becomes This expression has two parts: P(S|H), the probability of the speech signal given the hypothesized text; and P(H), the a priori probability of the hypothesized text. In the parlance of the field, the term P(S|H) is called the "acoustic model", and the term P(H) is called the "language model". The standard implementation of the P(S|H) term is a so-called "Hidden Markov Model" (HMM), and the standard implementation of the P(H) term is an "n-gram language model". (We're ignoring many details here, such as how to find the word sequence that actually maximizes this expression — again, see some of the cited references if you want to know more.) It's well known that large-vocabulary continuous speech recognition is heavily dependent on the "language model" — which is entirely independent of the spoken input, representing simply an estimate of how likely the speaker is to say whatever. This is because simple n-gram language models massively reduce our uncertainty about what word was said next. We can see this Lee and Chen's experiment, which looked at the effect of varying the language-model component of a recognizer, while keeping the same acoustic models and the same training and testing materials. (For those skilled in the art, they used the classic WSJ0 SI84 training data, and the Nov92 Hub2-C1 5K test set, described at greater length in David S. Pallett et al., "1993 Benchmark Tests for the ARPA Spoken Language Program", and Francis Kubala et al., "The Hub and Spoke Paradigm for CSR Evaluation", both from the Proceedings of the Spoken Language Technology Workshop: March 6-8, 1994.) Cross-entropy Perplexity Word Error Rate 3-gram Language Model 5.87 58 5.1% 2-gram Language Model 6.78 110 7.4% 1-gram Language Model 9.53 742 32.8% No Language Model 12.28 4987 69.2% Thus using a 3-gram language model, where the probability of a given word is conditioned on the two preceding words, yielded a 5.1% word error rate; a 2-gram language model, where a word's probability is conditioned on the previous word, yielded a 7.4% WER; a 1-gram language model, where just the various unconditioned probabilities of words were used, yielded a 32.8% error rate; and with no language model at all, so that every item in the 5,000-word vocabulary is equally likely in all positions, gave a whopping 69.2% WER. The 3-gram language model allows such a low error rate because it leaves us with relatively little uncertainty about the identity of the next word. In the particular dataset used for this experiment, the resulting 3-gram perplexity was about 58, meaning that (after seeing two words) there was as much left to be learned about the next word as if there were a vocabulary of 58 words all equally likely to occur — despite the fact that the actual vocabulary was about 5,000 words. (The dataset involved a selection of sentences from stories published in the Wall Street Journal, taking only those sentences made up of the commonest 5,000 words.) The bigram perplexity was about 110, and the unigram perplexity about 742. (If you want to know more about such numbers and how they are calculated, look at the documentation for the SRI language modeling toolkit, which was actually used to generate them.) If we take the log to the base 2 of these perplexities, we get the corresponding entropy, measured in bits. And there's an interestingly linear relationship between the entropies of the language models used and the logit of the resulting WER (i.e. log(WER/(1-WER))): A different acoustic-model component would have somewhat different performance — the best reported results with the same trigram and bigram models on this dataset are somewhat better — but the overall relationship between entropy and error rate will remain the same, and performance on high-entropy speech recognition tasks will be poor, even with careful speech and good acoustic conditions. This all seems reasonable enough — so why does Chin think that there's a problem? Well, there's good reason to think that human performance on high-entropy speech recognition tasks can sometimes remain pretty good. Thus George R. Doddington and Barbara M. Hydrick, “High performance speaker‐independent word recognition”, J. Acoust. Soc. Am. 64(S1) 1978: Speaker‐independent recognition of words spoken in isolation was performed using a very large vocabulary of over 26 000 words taken from the “Brown” data set. (Computational Analysis of Present‐Day American English by Kucera and Francis). After discarding 4% of the data judged to be spoken incorrectly, experimental recognition error rate was 2.3% (1.8% substitution and 0.5% rejection), with negligible difference in performance between male and female speakers. Experimental error rate for vocabulary subsets, ordered by frequency of usage, was 1.0% for the first 50 words, 0.8% for the first 120 words, and 1.2% error for the first 1500 words. An analysis of recognition errors and a discussion of ultimate performance limitations will be presented. If we project the regression line (in the entropy versus logit(WER) plot from the Lee & Chen experiment) to a vocabulary of 26k words (entropy of 14.67 bits), we would predict a word error rate of 90.5% — which is a lot more than 2.3%. Now, this projection is not at all reliable: isolated word recognition is easier than connected word recognition, especially when the words being connected include short monosyllabic function words that might be hypothesized to occur almost anywhere. But still, Chin's guess is that current ASR performance on the Doddington/Hydrick task would be quite poor — strikingly worse than human performance, and perhaps spectacularly so. And he thinks that this striking human/machine divergence points to a basic flaw in the current standard approach to ASR. For his diagnosis of the problem, see his keynote address at Interspeech 2012. I hope that before long, we'll be able to recreate something like to the Doddington/Hydrick dataset: a high-entropy recognition task on which human and machine performance can be directly compared. If this comparison works out the way Chin thinks it will, the plausibility of his diagnosis and his prescription for action will be increased.

Language Log: High-entropy speech recognition, automatic and otherwise

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112