Monday, Apr. 01, 1985

His Master's (Digital) Voice

By Philip Elmer-DeWitt

When Phyllis Weber learned earlier this year that a two-year-old child had died of brain disease in a San Francisco hospital, she immediately dialed a 24-hour alert number in Pittsburgh. The child's liver was still in good condition but would quickly deteriorate, and Weber, who is director of the Northern California Transplant Bank, had only a few hours to find someone who could use it. The voice she reached at the headquarters of the National Association of Transplant Coor- dinators (NATCO) wasted no time getting to the particulars:

"Age of donor?" it asked.

"Two years," Weber replied.

"What is the baby's weight?"

"Nineteen pounds."

"What was the blood type?"

"A-positive."

"Do you want the entire country searched for need?"

"Yes."

After a few moments, the good news came over the line: "The University of Minnesota has a possible recipient. Please dial the following number . . ."

Because of that short telephone conversation, Weber was able to locate a child in Minnesota who needed a liver-transplant operation. Even more remarkable, however, is the fact that the operator who took Weber's call was a machine. In fact, the NATCO alert line is run entirely by a computer that can both talk and listen.

Automatic speech recognition, the technology that enables computers to respond to spoken commands, is old hat to fictional electronic brains like HAL in 2001: A Space Odyssey, but still a primitive art in the real world. Computers are not yet discerning enough to cope with the ambiguities of spoken language or with a wide range of accents and tonal qualities. Making sense out of human discourse, says Dataquest Analyst Kenneth Lim, "is quite possibly the most difficult thing for a computer to do, other than actually thinking."

But computers, and their designers, are rising to the challenge. Hundreds of office and factory workers are already using simple voice-control computer systems to do everything from dialing telephones to controlling assembly lines. At Chicago's O'Hare Airport, for example, United Airlines baggage handlers call out the names of airports as they toss suitcases on a computer- driven conveyor belt. A voice-recognition system, responding to their commands, dumps each piece of luggage into a tray marked for the appropriate airport and sends it rolling toward its destination.

At California's Edwards Air Force Base, a more sophisticated system is being readied for use by test pilots who this fall will fly F-16 jets with voice- control equipment through simulated air and ground attacks. Rather than punching keys and flipping switches, pilots will bark out orders like "Arm the missiles!" Computers will then do the work, leaving the airmen free to concentrate on their targets.

Detroit, which has been producing talking cars for several years ("Your fuel is low"), is now testing models that listen. Several automakers have built experimental cars equipped with voice-activated door locks, starters and windshield wipers. Says Jerry Rivard, a chief engineer at Ford: "The technology is here today."

Less complex versions of that technology have begun to show up at electronics stores. Adventurous consumers can purchase voice-controlled personal computers, speech-activated video games, even videotape-editing machines that understand commands like "cut" and "splice."

The voice-recognition systems in most of these devices can be compared to a human searching through a key ring for duplicate pairs of keys; by aligning two keys at a time, he can quickly locate any pairs that have identical sawtooth patterns. In the electronic equivalent, a spoken command is transformed into a varying electrical current that can be represented by a wave with a characteristic set of peaks and troughs. This wave form (see diagram) is converted into a template, a pattern of zeros and ones that the computer can digest and store. By prerecording a number of key words, a user can build up a library of digital templates, each corresponding to a particular computer instruction. Then, whenever he utters a command, the computer compares the incoming pattern with the templates stored in its memory. When it finds a pattern that matches its master's voice, the machine executes the proper command.

Ingenious as it is, the template-matching system has some drawbacks. It must be preprogrammed for each individual speaker, is limited in vocabulary (often to 100 words or less) and is rather finicky. A speaker must enunciate clearly and pause between each word. If he catches cold, the computer may ignore his commands, and if he slurs his words, he may be misunderstood.

Try as they might, computer scientists seemed unable to overcome these deficiencies--until Victor Zue came along. Zue is a Chinese-born M.I.T. scientist who decided to teach himself to read spectrograms (computer-enhanced versions of the electrical wave forms of speech) as if they were words. This was no easy task. While spectrograms made by one person repeating the same word look alike, those made by another differ considerably. Zue discovered, however, that no matter how unlike spectrograms appear, they all have certain features in common. For example, the s in stop will appear as a dark rectangular wedge, no matter who is talking. Zue eventually cataloged hundreds of these identifying features, learning to recognize syllables, words, even whole sentences and nonsense phrases.

When news of Zue's accomplishment reached speech scientists at Pittsburgh's Carnegie-Mellon University, it was greeted with great excitement. The researchers knew that if a human could read a spectrogram, then a computer could too. In 1979 they invited Zue to spend a couple of days in their Pittsburgh laboratories. "I had no idea what was up," says Zue. The invitation turned out to involve 48 hours of rigorous testing with hundreds of voice spectrograms. At one point, the Carnegie-Mellon team tried to trip up Zue with the phrase "A stitch in dime saves nine," expecting him to misread dime as time. But Zue, having grown up in China, had never heard the old maxim. He passed the test and made believers of a generation of speech scientists.

Zue's discovery showed researchers how they could dispense with prerecorded templates. Now they could program their computers to identify the shapes and patterns that Zue had recognized in the spectrograms. That immediately made the machines more versatile. Rather than trying to match every peak and trough in the wave forms of someone's voice, they could search for only those acoustic features that are universal in certain words, no matter who speaks them. Advanced word-recognition systems using this technique are already in the hands of the National Security Agency, the top-secret Government bureau that monitors global communications networks. Eavesdropping on overseas telephone calls, NSA's supercomputers can pick out key words from among millions of simultaneous phone conversations, then isolate and tape-record any suspicious call for further investigation by human analysts.

The commercial possibilities of the advanced systems seem boundless. Kurzweil Applied Intelligence in Waltham, Mass., for example, is testing an automatic typewriter that will print out any of 10,000 spoken words. That development will come none too soon for James Ickes, 33, of Redondo Beach, Calif., who was paralyzed from the neck down in a football accident 14 years ago. Now he can use a voiceactivated computer to dial his telephone, operate a ham radio and compose his mail. He has even started writing his autobiography, dictating it one letter at a time. Cumbersome as this procedure is, Ickes has no complaints. "Previously, I vegetated," he says. "Now I have goals that keep me busy all day."

CHART: CHART TEXT NOT AVAILABLE

With reporting by Marion Lewenstein/San Francisco and Damian Musello/Boston