I picked up a new Windows Vista machine for testing purposes related to my startup recently and found myself trying-out the automatic speech recognition (ASR) engine. Under my the terms of my seed funding, a subcontractor of mine is investigating using ASR to transcribe scientific talks. The transcriptions would then be indexed to make the content searchable in a knowledge management service. I thought I would contribute a bit to their work by getting some qualitative and quantitative understanding of the quality of ASR today. I've also been encountering ASR in more and more places: my car integrates with my Blue Tooth phone and supports voice dialing; my son has an R2D2 toy that has crude ASR.
The quality of the Vista ASR for dictation started off very poor, but I was pleasantly surprised after a small amount of training that I could dictate at around 95% accuracy if I use a very careful "anchorman"-style enunciating patter. That 95% figure was over some challenging technical terminology, too, like "algorithmic information theory" and "computational linguistics". Not bad, but at a normal pace with "ums" and asides thrown in, the accuracy dropped to around 40%, which is essentially unusable. Even worse, my subcontractor reports that trying speaker-independent untrained systems from Dragon/Nuance goes to around 10% accuracy and that the out-of-the-box claims just border on outright lies for Naturally Speaking.
Still, 95% is an impressive technological achievement and the methods used to achieve these results are important enough that I want to mention them and describe a bit how I believe they relate to the topic of this blog.
A language model is a statistical distribution of how phonemes, words, characters, phrases, parts of speech, or any other feature of language occurs in a corpus of text or speech. ASR uses several interacting language models to try to predict what you are saying at any given time. A speaker independent engine has a large, generic language model that is supposed to be flexible enough to accommodate the vagaries of accents and intonation, prosody and pace. A trained system has a generic model with some of the probabilities adjusted by hearing you read some material (hopefully not too much material). When recognition takes place, the system has to take the sound bit after some filtering and try to match it against the most likely word in the language model based not just on the word itself but also on the other words that came before it. So the language model is not just a library of word probabilities, but also of the contexts of word probabilities. And here is where the systems fail: the context analysis is currently too shallow and lacks higher-order corrections from semantic and syntactic sources. Well, that is a bit too egregious of me, since there are some higher-order corrections implicit in the sequences of the temporal language models. It's just that language is so sparse that there is not enough context to encode all of the complex interacting variables that go into a perfect model. It works that way for us, too, since we can easily be made confused by very rare verbal utterances ("furious the fever fled forth from me..."; "what, say again?").
The point here is that with increased computing power and enhanced training there is no reason that ASR can't get to 98% or even 99.9% speaker-independent. I would guess that we need another 10 years to close that gap, but there are no theoretical challenges for most mainstream speaking transcription tasks. And when that happens, there will be a shift in the way that we regard computing machines. Children will grow up expecting voice control over toys, tvs, dolls, video games. They will expect conversational capabilities in their entertainment, and will develop a two-tiered mental model for how to interact with people versus the merely verbally reactive. I am guessing it will become a common way to insult someone by conversing with them like they are among the semi-sentient.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment