Hearing voices

Evolving science of vocal tones catches up to what baby knows

By Mary Wiltenburg Staff writer of The Christian Science Monitor

February 13, 2003

They can say they want to live. But for Kim Cates, founder of a Boston-based suicide hotline, the question of whether her callers are a danger to themselves really comes down to how they say it.

"Because stuff comes out that they don't even know themselves," she says. When they're too upset to be lucid, their tone "gives them away."

Though the stakes are rarely this high, we all make such judgments about strangers based on their voices. Every conversation we have carries a subtext that would be invisible to someone reading its script: the uptilt to a question, the long sneer of sarcasm, or the quaver of uncertainty. Only 7 percent of the meaning of what people say comes across in the words they choose, says psychologist Albert Mehrabian, who has spent the past four decades researching communication. More than five times as important is what their voices convey.

The study of these vocal cues has lately come to the fore because of a growing commercial demand for speech-recognition software. Used for everything from taking basic dictation to unlocking sophisticated security systems, these programs rely on an understanding of what most of the linguists, engineers, computer scientists, and speech therapists involved in its study term "prosody": the intonation, stress, and rhythm that make up the music of a voice.

Even now, some computer systems well- attuned to prosodic cues can speak and "listen" in an eerily "natural" way. Last year, Amtrak replaced a touch-tone phone system, which had driven callers crazy, with "Julie," a software package whose designers and users talk about as though "she" were human. "She's been very popular," says Amtrak spokeswoman Karina VanVeen. "Some people don't realize she's a computer 'til halfway into the call."

But even before computer research forced the issue, the rest of us were subliminally studying prosody. Most kids begin to learn this language - more vital than the vocabulary it underlies - as they're learning to talk, by mimicking the adults around them. From there, it runs its mostly unconscious course right into adulthood, every year adding increasingly complex layers of meaning to conversations with family, friends, and co-workers - and worlds of conjecture to phone conversations with strangers.

"We're always listening with a third ear," says professor of speech and hearing sciences Moya Andrews, "for anger, for the subtext of a conversation, for a sympathetic voice."

Before she went blind a quarter-century ago Cheryl Linnear painted portraits. Even today, she says, when she meets a new person and hears their voice for the first time, she pictures them: "Do they have those lines on their forehead that mean they frown a lot, or those worry lines around their mouth? You learn a lot by looking at people."

But over the years, Ms. Linnear says, she's come to regard those visual cues as secondary to her understanding of people - and not just because she can't see them. If you're really listening, she explains, "it's like the voice gets inside the soul - it leads you there a lot more than if you just looked at someone."

Speech therapists insist that visual cues are for most people a major component of communication. (In his book "Silent Messages," Dr. Mehrabian finds that gesture and facial expression account for 55 percent of the meaning of speech, prosodic cues for 38.) But all agree that a sensitivity to vocal cues is critical to emotional understanding.

"Think, for example, about how many different ways you can say 'I love you,' " says Dr. Andrews. "You can say it scornfully, you can say it playfully. So many different ways, and it all depends on vocal behaviors that are not included in the text."

Don't use that tone with me

When the first computer speech-simulation programs came out in the 1950s and '60s, they were without even a nod to prosodic subtleties. Every syllable had the same length, emphasis, and tone: The result was "that flat robotic monotone from early sci-fi movies," remembers Robert Ladd, professor of linguistics at the University of Edinburgh in Scotland.

Even efforts in the '70s and '80s to model computerized voices more closely on human speech resulted in what Stephen Springer, a design director at SpeechWorks, a Boston-based speech-recognition software company, calls "the funny-sounding computer systems that always sounded" - he imitates, with heavy accent - "like a drunken Swede when they talked to you." ("No offense to your Swedish readers," he adds, "that's kind of industry shorthand.")

That more-or-less featureless monotone, known professionally as "flat affect," is often also the butt of complaints about telemarketers: Not only do they call just at suppertime, but they sound half dead.

"If you're at dinner and the phone rings, and you pick it up and hear someone reading off a joyless script," says Mr. Springer, "your mind turns immediately to 'How can I get this person off the phone?' You're already steeling yourself, and they've maybe said six words to you. They're not even bad words."

Verbal cues

But parents and teachers of kids diagnosed with neurological conditions like autism and Asperger's Syndrome know in a serious way the communicative cost of an underdeveloped sense of speech prosody. People with severe autism operate in a sort of linguistically sealed environment, unable to decode or produce emotional voice cues; their voices often come across as robotic or monotone.

"And in school," asks Patricia Prelock, autism and language-learning disabilities specialist at the University of Vermont, "if the teacher says, 'It's awfully noisy in here,' and her voice is angry or upset, how is the child who doesn't hear those emotions going to know to be quiet?"

The number of cases of autism in the US is ten times what it was a decade ago, so there's a great deal of study now being done about how autistic children's brains short-circuit prosody. Research-ers hope their work will yield better speech and language therapy techniques for people with a host of communication disorders. In most cases, they say, prosody, can indeed be taught.

But in other instances, particularly reacting to sounds and voices, the body seems to have a mind of its own. We're programmed, for instance, to react to a 20-hertz rumble (roughly the frequency of a distant elephant stampede) with terror, and to certain musical frequencies with deep sympathy.

"That was Martin Luther King, in his 'I Have a Dream' speech," says Edward Komissarchick, vice president of BetterAccent, a softwaremaker that helps non-native speakers map their vocal patterns to more closely mimic those of native English speakers. In school, he read a study that analyzed tapes of Dr. King's famous speech, and compared them to recordings of King's regular speaking voice.

Delivering the speech, the pitch of King's voice was much higher than that of his normal voice - a frequency usually only heard in music. "When he said, 'I have a dream,' he didn't speak," Dr. Komissarchick says, "He sang. That's how he created the magic of that."

Don't say it, sing it

In fact, says Andrews, whether or not we realize it, the voice always is a performance. For 25 years, she directed Indiana University's speech and language clinic, which worked closely with the nearby Kinsey Institute for Research in Sex, Gender, and Reproduction. Over the years, in addition to stuttering kids and shy corporate execs, Andrews coached many transgender clients in a particular prosodic application: making the vocal transition from male to female.

That meant teaching them to control their voice pitch and mimic typically female voice patterns - a more musical cadence, an upswing at the end of a sentence. For transgender people, Andrews explains, "usually the voice is the big giveaway, in terms of "passing".... When they talk, if it sounds really masculine, heads usually turn, and that's what they're really afraid of."

As are we all, she says: "Like the teen-ager who moves from Georgia to California and boom, drops her accent, at some level, we're always trying to fit in."

Why is Christian Science in our name?

Your subscription makes our work possible.

Hearing voices

Hearing voices

Help fund Monitor journalism for $11/ month

Unlimited digital access $11/month.

Digital subscription includes:

Related stories

Subscription expired

Session expired

No subscription