Voice-Recognition Systems Boom
| BOSTON
PHONES you can dial simply by repeating someone's name and typewriters that type what you say to them may be commonplace by the middle of the next decade, thanks to recent developments in computerized speech-recognition technology. Word processors that can listen to a person talk and produce a printed transcript have been on the market for almost a year. Although they are expensive and require the speaker to pause between each word, the machines have made dramatic differences in the lives of people with physical disabilities and promise to save time and money in clerically intensive fields such as commodities trading and medical reporting.
``I [can] dictate a letter ... and send it over a network to another computer and have it printed,'' says Frank Whitney, a computer programmer at the United States Department of Defense who is able to use only one finger. ``If I were doing it using my finger, I would be halfway through the first paragraph,'' in the same amount of time, he says.
``I have spoken to half-a-dozen handicapped folks, paraplegics, and others who are using these systems,'' says Christopher R. Seelbach, an analyst at Probe Research, a market-research firm that follows the voice-processing industry. ``For those who can afford the $10,000 to $15,000 for a system, it basically changes their lives.''
The concept is not new. Computers designed to recognize 50 to 100 spoken words have been around for nearly 15 years, says Janet M. Baker, president of Dragon Systems, a Boston-area company that sells voice-recognition equipment and software. ``The early systems didn't work very well,'' often making mistakes, and they were unable to tell the difference between background noise and speech, says Dr. Baker.
By the mid-1980s, however, the accuracy of these small-vocabulary systems had improved. Companies started using them for inventory and quality control.
``Three years ago, Xerox Corporation was able to conduct a cost-effective, 100 percent audit of 2.2 million parts in two months [using such a system],'' Baker says.
Small-vocabulary systems are speaker-dependent. They must be ``trained'' to recognize the user's voice in a 10-minute session, during which the computer flashes words on the screen and the user repeats them. Both Dragon and Kurzweil Applied Intelligence, another Boston-area firm, have recently developed large-vocabulary, speaker-independent systems that do not require training for each new user. Dragon sells a system that Baker says can recognize 30,000 spoken words. In March, Kurzweil plans to introduce a system for medical dictation that will recognize up to 10,000 words, says Vladimir Sejnoha, a research engineer with the company.
Although the specific recognition techniques employed by Dragon and Kurzweil are different, basic speech recognition involves converting sound picked up by a microphone into a series of acoustic frames or segments, each 1/100th of a second long. Each frame is analyzed and a set of mathematical constants representing tone and change in volume is extracted. The constants are in turn translated into phones, the smallest distinctive element of spoken language. Silences between phones are used to signify breaks between the words. The phones are matched against a phonetic dictionary and then changed into standard English spelling.
The actual systems are much more complicated, Baker stresses. ``Just doing a phone identification, and then doing a look-up on that, does not work.... You need to make use of many kinds of information simultaneously.'' For example, the software considers the context of the spoken word in the sentence to determine probability of a match against words in the phonetic dictionary. Such techniques also help the system decide between homonyms like ``through'' and ``threw.''
Less than a second after the word is spoken, the computer displays it on the screen. If the computer is not positive about its decision, it also displays a box of similar-sounding words; by saying ``take-two'' or ``take-three,'' the speaker can correct the computer and substitute the second or third choice for the computer's first. Depending on the system, the speaker, and the background noise, the computer's first choice is correct anywhere from 80 to 95 percent of the time.
``One of the problems with speech recognition is the high degree of variability among individuals,'' says Mr. Sejnoha. Kurzweil's system features a special enrollment procedure in which a person speaks 400 representative words. These are used ``to produce a model for how you say the rest of the vocabulary in the system,'' explains Bernard Bradstreet, the company's president.
Dragon Systems uses an adaptive algorithm or procedure in which the computer updates its internal model on the basis of each word that is correctly matched.
Although having to pause between each spoken word is a drawback, most people can dictate at 15 words per minute the first time they use the system, Baker says.
``My 12-year-old son did all of his social studies reports on it last year,'' she adds. ``It was faster for him than typing. Most 12-year-olds are not skilled typists.''
Neither are the ``majority of people in the professional and business community,'' she says. As an added benefit, every word is perfectly spelled. ``We think [widespread use of speech-recognition technology] would dramatically improve the accessibility of computers and the information available through them.''
Dictation systems for physicians based on both Dragon's and Kurzweil's technology were shown recently at the annual meeting of the Radiological Society of North America in Chicago a few weeks ago.
``This is the future,'' says Melvyn Conrad, a radiologist at the Nan Travis Memorial Hospital in Jacksonville, Texas. ``One of the major advantages is you can see the words appear in front of you, so you don't have to review the words later.'' Most radiologists today record their reports on tape, and sometimes have to wait as long as a week to review the typed copies. In addition to the delay, ``the secretary doesn't always type what you say,'' says Dr. Conrad.
Medical reporting has been one of the first applications ``to really take off,'' because of the limited vocabulary and the tremendous clerical backlog, says Mr. Seelbach of Probe Research. Kurzweil has 137 medical dictation systems already in the field, says Mr. Bradstreet.
Nevertheless, researchers are working hard to develop systems that are more accurate and able to recognize speech without pauses between the words. In its Yorktown Heights research laboratory in upstate New York, IBM has developed a speech-recognition system that averages between 90 and 95 percent accuracy on discrete words. Even that, say some researchers, isn't good enough. ``It puts too much burden on the user to have to correct every 10th word,'' says Lalit Bahl, manager of natural-language speech recognition at the laboratory.
Besides increased accuracy, says Dr. Bahl, speech-recognition systems need to be able to process continuous speech, so that users don't have to pause between each word. IBM has such a system, but ``it's about 50 to 100 times slower than the real-time systems,'' Bahl says. Running on an IBM 3090 mainframe, it takes the computer more than an hour to analyze a minute-long recording of a person speaking in a normal voice.
``If you look toward the future, the area of speech recognition that we are working in will clearly lead to [a system with] a functionally unlimited vocabulary, complete speaker independence, and continuous speech recognition,'' says Kurzweil's Bradstreet.
Voice-recognition researchers expect that it will be three to five years before desktop computers have enough power to do the job.