A clock, a toy, a foreign car, a TV remote control, an elevator, a calculator , an inventory control system, and Hal -- the computer of "2001: A Space Odyssey" -- all have something in common. They are examples of manmade objects that either speak or understand human speech. Yet only the last one, Hal, is science fiction. The rest are real products presently being sold either in the US or abroad.
They are results of technlogical advances in the field of speech processing. Helped by advances in very large scale integration (VLSI) -- so-called "microchip" technology -- speech processing promises to revolutionize both industry and commerce. It should also have a significant impact on all of us as individuals in our jobs and in our everyday life.
The field of speech processing may be divided into three major categories -- speech synthesis, speech recognition, and voice transmission.
Speech synthesis involves a system which produces the spoken word (or phrases). Techniques for producing electronic speech synthesis (ESS) range from modeling the human vocal tract to storing the basic elements of speech and reconstructing it according to established linguistic rules. These elements, called phonemes, play a role in speech equivalent to that of letters in text.
With the prospect of 100,000 circuits on a single "chip" at extremely low cost, theoretical developments in the area of speech synthesis have already led to over a dozen "talking chips" for generating synthetic speech being sold or in design. Estimates of the market potential for ESS range from $1 billion to $5 billion by 1990.
Typical products or applications already available include a talking clock that tells the time when you push a button, a telephone answer-back system that informs you of the incorrect number you have just dialed or asks for more money, a car that requests "please turn off the lights" in a soft feminine voice, and an elevator that announces the floor at which it is stopping. An optical character reader (OCR) with a synthetic voice can help a blind reader. And many children have enjoyed an educational spelling game that repeats the letters you have selected and lets you know whether or not you are right. If not, it gives you one more chance and then tells you the answer.
However, the most exciting prospects by far lie with the second and less advanced area, that of speech recognition. The motivattion for developing a system which can accept the spoken word as an input command is supported by almost endless opportunities for systems with electronic ears.
Speech recognition, implemente by "listening chips," may be broadly broken down into two major types of applications -- speaker recognition and word recognition. In speaker recognition, the objective is to determine the identity of the speaker based on his or her speech characteristics rather than to determine what is said. It is being studied both for military use and for handling problems of security such as credit card identification or suspect identification (as with fingerprints).
Needless to say, the combination of speaker and word recognition is a much- sought-after capability. It has been exemplified by the James Bond movies where people can open safes using voice commands, an act which requires both recognition of the numbers spoken and of the speaker. However, this is considerably farther down the road.
In the more immediate future, the next three to five years, the real interest is in word recognition without much regard to the identity of the speaker. Applications are best characterized in terms of the size of the vocabulary required, the number of different speakers it can handle, and the amount and type of noise background or interference that may need to be tolerated.
Ideally, all of these capabilities should be maximized. In practice, applications usually involve only one or two of these factors in a significant way. For example, dictating directly to a typewriter involves a large vocabulary. (With my handwriting, I wish I could do it as I write this article.) However, it can be adapted to a small number of speakers, would operate in a moderately noisy environment in a typical office, and offers twice the speed of the best typist. On the other hand, an automated telephone message service would involve a large number of speakers, but, using appropriate code identification numbers, could use a relatively small vocabulary and could operate in a moderately noisy environment.
In general, present applications in word recognition use voice recognition systems which are first "trained" by the speaker to "learn" his or her vocal features. In this way, one adaptable system may be used with a relatively large number of speakers by individual training.
The ultimate objective is to recognize connected speech (normal spoken sentences) independently of the speaker. The present "state of the art" is demonstrated by systems which now can recognize about 100 words following a "training phase" and are therefore termed speaker dependent. The words, moreover, must be spoken with reasonable time spacing between them. The systems thus are called isolated word recognizers. The best systems available reduce this time to about 30 milliseconds (0.03 seconds). This comes very close to connected speech, but doesn't quite make it.
These present limitations notwithstanding, the US market for speech recognition modules was $10 million in 1980 and is expected to be at least 10 times that by 1984, reaching about $1 billion by 1988. Here again, VLSI has already played a significant role in the development of "listening chips." One company, Interstate, is planning to introduce the first speech recognition device on a single chip later this year. Although limited to only eight words with an estimated 85 percent correct recognition capability, it will cost only $ 10!
There are also several effective applications already in place for speech recognition. Computer graphic terminals are being marketed which operate using a limited number of voice commands. In industry, automatic sorting systems are in place using voice control, and robot arms can be operated under voice control when the operator has both hands occupied. For home applications, a telephone message service recognizes spoken identification codes and then recites your messages, and remote speech control of color TV sets have been demonstrated.
The last category of speech processing involves transmission of voice from place to place. This usually occurs over telephone lines and satellite channels. In large part, this application has been researched and developed by a few select companies over the years. The primary mode of transmission is digital -- that is, the voice signals are represented by numerical codes made up of the ones and zeros of the binary number system. This is done for reasons of efficiency, multiplexing (more than one conversation on a common set of wires), and power reduction and also because digital signals are easily regenerated.
Conventional digital transmission, known as pulse code modulation (PCM), requires transmitting between 48,000 and 64,000 binary digits (ones and zeros) per second. However, researchers over the years have been able to transmit voice over these same channels using special techniques which require only one-tenth as much code or 4,800 binary digits per second. In some cases, they have reduced this requirement to as low as 1,200 bits per second. Yet the signals are able to retain the intelligibility and recognizability required by the customer. This has led to more conversations being carried over the same wires than ever before. The technique has become particularly appealing with the advent of optical fibers whose ability to carry many messages on a laser beam has further increased the capacity for voice traffic.
Overall, voice is by far our best means of communication. It is natural to want to communicate with our machines, as well as with ourselves, over long distances and using our own language for voice commands. And it is reasonable for us to ask these machines to return the favor. Strongly supprted by the recent VLSI technology advances, this is but the beginning of what promises to be one of the most rapidly developing fields in technology over the next 10 years.