Vocal resonances and broad band excitation

This site introduces a technique developed in this lab for measuring some important acoustic properties of the vocal tract non-invasively, in real-time, while the owner of the vocal tract is speaking or singing. We use it as a research tool, but we have demonstrated its use as a speech trainer. You may wish to read Introduction to vocal tract acoustics before continuing.

Existing technologies used in speech pathology and speech trainers to provide visual feedback from the speech sound are inherently limited in precision and practicality. Even the most advanced speech recognition systems still mistake words, which indicates the limits of their precision in accurate measures of pronunciation. The basic problem is that the speech signal alone does not have enough information in it to allow us to work out, quickly and precisely, the configuration of the vocal tract. This is not a problem for understanding speech, but it may be a problem in learning precise pronunciation. Our approach is therefore to introduce a signal with more information in the frequency domain.

Our technology is called Real-time Acoustic response by Vocal tract Excitation or RAVE. In model experiments using the laboratory prototype, we have shown that one or two hours' training using visual feedback of some key features of the acoustical response of a subject's vocal tract improves the accuracy and intelligibility of pronunciation of foreign phonemes by monolingual adults.

How it works:

We inject into the vocal tract an acoustic current which is synthesised to give high resolution frequency information over the frequency range of interest. We then measure the impedance of the vocal tract in parallel with the external field using the response to this excitation signal.

graph showing voice harmonics and resonances independently

In this figure, the author pronounces the vowel in 'heard'. The sharp vertical peaks are the harmonics of my voice. The broad signal shows the response of my vocal tract to the acoustic curent signal being injected from the lips.

For this vowel, my vocal tract behaves rather like a cylinder about 170 mm long, nearly closed at the vocal folds and open at the mouth. A cylinder, length L, closed at one end has resonances at f0 = v/4L , at 3f0, 5f0 etc, where v is the speed of sound. (See pipes and harmonics.) So we see resonances at about 0.5, 1.5, 2.5, 3.5 and 4.5 kHz, which appear as the peaks in the smooth curve in this figure. When I pronounce the vowel in "had", I open my mouth wider, so the tract is no longer cylindrical, but flared at the open end, a bit like the flare and bell on a brass instrument. One of the effects of a this shape in a brass instrument is to raise the frequencies of the resonances, especially those of the lower resonances. (In a related example, conical pipes have resonances at higher frequencies than do cylindrical ones. See this link for an explanation.)

From this response we can readily determine the resonances of the vocal tract, independently of the speech signal. The resonant frequencies are interesting for fundamental acoustical phonetic research but, if we extract them in real time, they can be used to drive a cursor for speech training. This is how we do it in the real time version.

diagram showing how to extract vocal tract resonances in real time

Schematic diagram. (a) shows the spectrum of the speech signal alone. This male voice has harmonic partials spaced at the pitch frequency 126 Hz. (b) The injected signal has frequencies spaced at 5Hz, whose amplitudes are calibrated (in this case) using the radiation field outside the speake's mouth. (c) The sum of the speech signal and the broad band signal (including the effects of the resonances) goes from the microphone to the ADC. The speech signal is used to measure pitch and amplitude; then the harmonic components below 1kHz are removed. (d) The resonances are detected from the remaining interpolated signal. Similarly, the broadband signals may be removed to leave just the speech harmonics. In the real-time version of the device used for speech training, the resonance frequencies are used to position the cursor on the vowel plane (see below). Notice that the signal:noise ratio in these figures is greater than in the preceding figure. This is a consequence of making the measurements rapidly.

How it looks:

screen dump of real time display

This is a screen dump of the feedback display in the current speech trainer device, set up with targets from Australian English. The background ellipses are measurements of the vowels of 33 Australian men, with mean values for each vowel at the centre of each ellipse. The semi-axes are the standard deviations in R1 and R2. These or other areas can be used as targets in speech training. A cursor on the monitor (the cross at (1190,530)) shows the current configuration of the subject's own vocal tract. Initially, subjects 'steer' the motion of the cursor by consciously controlling jaw and tongue position. Speakers of the language displayed can 'aim' towards one of the vowels shown. After some practice, however, it becomes nearly as automatic as using a joy-stick or a mouse - one just 'makes it go' where one wants, without thinking of the muscular details. In other words, a visual feedback loop is unconsciously used to train articulation.

Does it work?

For a report of a trial experiment using a prototype system as a language trainer, see our papers:

More pages on related topics


[Basics | Research | Publications | Flutes | Clarinet | Saxophone | Brass | Didjeridu | Guitar | Violin | Voice | Cochlear ]
[ People | Contact Us | Home ]

Joe Wolfe / J.Wolfe@unsw.edu.au
phone 61-2-9385 4954 (UT + 10, +11 Oct-Mar)
Joe's music site

Happy birthday, theory of relativity!

As of June 2005, relativity is 100 years old. Our contribution is Einstein Light: relativity in brief... or in detail. It explains the key ideas in a short multimedia presentation, which is supported by links to broader and deeper explanations.
Music Acoustics Homepage What is a decibel? Didjeridu acoustics