Infants master speech so easily that many adults are astonished by how hard it is to get machines to repeat the trick. Ask a computer to type, “Machines can recognize speech,” and on a good day you might get back, “Machines can wreck a nice beach.” To learn how computers are faring with speech I hooked up with my cousin Richard Wiggins at SpeechTEK 2009, a conference just off Times Square where exhibitors demonstrated the latest efforts at getting machines to recognize speech and to produce spoken language.
Richard is one of the foremost explorers in the field. In the mid-1960s he worked as a mathematician for the National Security Agency. His work remains classified, but the agency’s focus was on getting machines that could eavesdrop on the Russians. In the 1970s he invented the first talking computer chip. It was introduced in a toy called “Speak & Spell” from Texas Instruments, and the chip soon was in elevators and airports. By now everyone is familiar with the most annoying form of this technology: the telephone answered by a machine that says, “If you want to pay a bill, press 1. If this is an emergency, press 2.”
The first exhibitor we noticed was VoiceVault, which had a system for identifying people over the telephone by recognizing voices. The caller is told to repeat a series of numbers like 7 0 5 9 and the system matches the caller’s response against voice prints that are already on file. Richard used the same general technology thirty years ago to secure his speech lab.
“Two things have changed since then,” he said. People are more used to talking to machines, eliminating the problems of getting cooperation. Also, machines are far more powerful. In theory, at least, an identity thief today could record the speech of the proposed victim, call up the telephone checker and have a hand-held device like the iPhone recognize the digits and recite them back in the voice of the stolen identity. “Both sides of the technology are getting smarter.”
Another exhibitor, Neospeech, struck me as offering a very old-fashioned idea, but Richard was impressed. Their system turns printed text into spoken language. The work was the latest level of technology that Richard himself had invented over thirty years ago. He tested it out, using prhases that he knew challenged the machinery. First off was, “The wind would wind around the stairs.” Sure enough, the machinery read each ‘wind’ correctly.
Questions and vowels are another challenge, but the system gave a clear reading of, “Which tea party did the father go to?” and it even had the intonation of a question and falling "go to."
The final test got the demonstration device to read, “Say eighty, please.” Richard listened to it several times until he was positive that the machine really was saying the word as American’s speak it, “Eigh-ddy,” not as it is spelled.
“They must have some special rule for that one,” Richard said.
“Maybe,” the demonstrator agreed, “Or ‘eighty’ could be a part of the original recorded data.”
“Oh,” Richard was surprised, “How much did you record?”
“Thirty to forty hours per voice actor.”
Richard jolted his head back. That was long enough to record a couple of full-length novels. He had never had anything like that amount of data to work with. People have had new ideas in the past three decades, but the critical push has come from the constantly declining cost of mechanical memory.
The rarest kind of exhibitor provided the ability to translate spoken words into text. Recognition is especially challenging in commercial applications that let a human and a machine converse (“interact” is the preferred term). A Microsoft-owned subsidiary called Tellme was the most prominent recognition company at the show. Richard had many questions for the demonstrators, and it quickly became apparent that the secret of designing a good IVR (Interactive Vocal Response) system still lies in tightly limiting the user’s range of possible responses.
Ask, “On a scale of one to five, with five being the best score, how would you rate President Obama’s health care plan?” The machines are sophisticated enough to handle a range of numbers.
Do not ask, “What do you think of President Obama’s healthcare plan?” where the possible answers grow toward infinity. For people, language’s grand breadth is liberating, but for machinery it is still overwhelming.



Your post begins with an egregious equivocation on the concept of "mastering speech". Infants do not easily master conversion of a spoken form to a written form, in the absence of either a rich semantics of the physical world or a rich pragmatics of human interaction in a shared environment. Even when these projects succeed in their stated goals, the successful mechanism and skill set will hardly be comparable to human processing of the speech signal in conversation.
AIs are not trailing far behind human brains in the same race. They are using profoundly different mechanisms to produce complementary skills. After all, Edison's first phonograph already kicked human ass on its own terms, i.e. repeating a speech signal without major phonetic distortion.
Posted by: J. Goard | August 29, 2009 at 11:09 AM