AWS Machine Learning Blog

Modify the Timbre of Amazon Polly Voices with the New Vocal Tract SSML Feature

Today, the Amazon Polly team is pleased to announce the release of a new Speech Synthesis Markup Language (SSML) feature that allows developers to modify the timbre of any Text-to-Speech (TTS) voice. This is a compelling feature for customers who want to customize an existing voice in the Amazon Polly portfolio to make it sound more like the specific persona that they are seeking for their use case. It’s especially useful for customers with scenarios that involve many different voices because the timbre feature makes it easy for them to customize multiple voice personas from each of the available Amazon Polly voices.

What is timbre?

Timbre describes the perceived color or quality of a sound, independently from pitch or loudness. It’s often used in music, for example to point out the difference between brass instruments and string instruments, or to describe the more subtle distinction between a viola and a violin. Timbre is the perceived attribute that sets each instrument apart, even when they’re all playing the same note at the same volume. Similarly, in the context of speech, timbre is what differentiates one voice from another, even when their pitch (that is, their fundamental frequency) and loudness (amplitude) are the same.

The sound of each individual’s voice is unique due to a variety of factors, including that person’s physiology, as well as the way in which they produce sound. The size and shape of a person’s vocal folds, vocal tract, and even their overall body size and shape, play an important part in determining their typical speech quality. The ways in which a person positions the tongue, tightens or relaxes the muscles, or applies air pressure are just some of the techniques by which pitch, volume, and timbre of the voice can be modified. Trained impersonators learn to control these movements to such a degree that they are even able to alter their voices to make themselves sound like somebody else.

Vocal tract versus pitch

An important physiological feature that contributes towards speech timbre is the vocal tract, which is a cavity of air that spans from the top of the vocal folds up to the edge of the lips. There are a variety of muscles that make it possible to change the shape of the vocal tract cavity by making it longer, shorter, wider, or narrower. The effect of these changes causes the resulting speech sounds to be amplified or filtered out.

Pitch is an auditory attribute that causes a sound to be perceived as higher or lower. In the context of speech production, pitch is determined by the frequency of vocal fold vibration. Women generally have shorter vocal folds that vibrate more frequently (~180-200 cycles per second) relative to men. Men have, on average, longer vocal folds that vibrate more slowly (~110 cycles per second). Similarly, the average vocal tract length is shorter for women than it is for men (~14cm vs ~17cm).

There is a natural correlation between vocal fold length and vocal tract length, such that when one increases the other tends to increase as well. The Timbre feature allows developers to change the size of the vocal tract while retaining the ability to control pitch.

Vocal tract and speech synthesis

With the vocal-tract-length SSML tag, you can now control the timbre of the input speech by changing the length of the speaker’s vocal tract. This can sound like a change in the speaker’s body size. When you increase the vocal-tract-length, the speaker will sound like they’re bigger. When you decrease it, they will sound smaller. This tag can be used with any of the voices in the Amazon Polly Text-to-Speech portfolio.

Here’s how you can modify the length of the speaker’s vocal tract:

  • +n% or -n%: adjusts the vocal tract length by a relative percentage change in the current voice. For example, +4% or -2%.
  • n%: adjusts the vocal tract length to an absolute percentage value of the current voice. For example, 104% or 98%.
  • Vocal tract length can be increased up to +100%, and down to -50%.
  • To reset the vocal tract length to the default value for the current voice, use <amazon:effect vocal-tract-length=“100%”>

The following example shows how the vocal tract length can be modified, using Joanna’s voice:

<speak>
This is my original voice, without any modifications. <amazon:effect vocal-tract-length="+15%"> Now, imagine that I am much bigger. </amazon:effect> <amazon:effect vocal-tract-length="-15%"> 
Or, perhaps you prefer my voice when I'm very small? </amazon:effect> You can also control the 
timbre of my voice by making more minor adjustments. <amazon:effect vocal-tract-length="+10%"> For example, by making me sound just a little bigger. </amazon:effect> <amazon:effect vocal-tract-length="-10%"> Or instead, making me sound only somewhat smaller. </amazon:effect> 
</speak> 
Listen now

Voiced by Amazon Polly

Combining multiple tags

You can combine the vocal-tract-length SSML tag with any other SSML tag that is supported by Amazon Polly. Since vocal tract length and pitch are closely connected in nature, you might get the best results by changing the vocal tract length together with the pitch (by applying the <prosody pitch> tag).

<speak> 
The pitch and timbre of a person's voice are connected in human speech.
<amazon:effect vocal-tract-length="-15%"> If you are going to reduce the vocal tract length, </amazon:effect>
<amazon:effect vocal-tract-length="-15%"> <prosody pitch="+20%"> you might consider increasing the pitch as well. </prosody></amazon:effect>  <amazon:effect vocal-tract-length="+15%"> If instead you choose to lengthen the vocal tract, </amazon:effect> 
<amazon:effect vocal-tract-length="+15%"> <prosody pitch="-10%"> you might also want to lower the pitch. </prosody></amazon:effect>
</speak> 
Listen now

Voiced by Amazon Polly

We encourage you to experiment with different combinations of vocal-tract-length and pitch settings, in order to find the voice quality that best suits your needs. To assist you in this exploration, we’re providing you with the following matrix of samples that you can download and listen to. You’ll note that these samples range from very lifelike speech, to more character-like speech.

Vocal-Tract-Length and Pitch Sample Matrix (.ppt)

If you have any questions, please leave them in the comments.