Device turns gestures into song

Researchers at the University of British Columbia demonstrate a gesture-controlled artificial speech system that's good enough to sing.



Researchers have created a system that converts hand gestures into speech, and yes, into song as well. Although the system isn't yet ready for a shot at "American Idol," its name — Digital Ventriloquized Actor, or DiVA — gives you an idea where the technology is going.


"It is a singing synthesizer," said Sidney Fels, director of the University of British Columbia's Media and Graphics Interdisciplinary Center, or MAGIC. Fels explained how DiVA does its thing today in Vancouver at the annual meeting of the American Association for the Advancement of Science.

With the gestures of the right hand, DiVA's operator controls the pitch and the character of the sounds. Closed-hand gestures produce consonants. Open-hand gestures produce vowels. Meanwhile, the left hand is hooked up with finger contacts to create stop sounds like and buh. "We designed a gestural space that mimics the vocal tract," Fels explained.

The result is eerie: In the video above, you'll see a singer accompanying herself with the DiVA's voice. (I'm not ready to put it on my playlist just yet.) And in a series of videos, DiVA operator Sageev Oore synthetically sings the alphabet song and recites Dr. Seuss' "Green Eggs and Ham" verse as if he were playing two characters. (Which is kind of like Gollum talking to himself in the "Lord of the Rings" movies.)

If DiVA goes commercial, it could provide a new way for people with speech disabilities to make themselves heard. But why go to all that trouble when there are other speech synthesizers out there, including the electronic voice made famous by physicist Stephen Hawking?

"The problem with that is, you won't be able to sing. You won't be able to be expressive," Fels said.

One of the intended applications for the technology is to create new types of singing musical instruments that can be played in real time. Fels said there have been five compositions written for DiVA so far, played by musicians trained to use the device. "It takes about 100 hours for a performer to learn how to speak and use the system," Fels said in a news release.

The gloves, the volume-control foot pedals, the magnetic-sensor system and other components that bring DiVA to life can get rather unwieldy. "It's a backpack full of equipment," Fels told journalists. "I wouldn't walk around the restaurant and order sushi with it." But Fels and his MAGIC team are developing a version that can be operated with a computer tablet.

That hints at what may be more important applications in the longer run. The DiVA project got started as a way to teach people how to control a complex system with gestures and give them auditory feedback to let them know when they're doing the gestures right.

"Other possible applications for this discovery are interfaces to make certain tasks easier, such as controlling cranes or other heavy machinery," Fels said. It's also conceivable that gesture-based training might offer an alternative way to learn and practice foreign languages, particularly Asian dialects that depend on precise tonal control.

Gesture-controlled input devices ranging from Nintendo's Wii and Microsoft's Kinect have already revolutionized the gaming industry. Will DiVA, or other devices like it, open up a whole new frontier for the field? Does the future belong to gestures? Feel free to weigh in with your comments below.

More about gesture-controlled devices:

More from the AAAS meeting in Vancouver:


Since I mentioned Kinect, I should note that msnbc.com is a joint venture involving Microsoft as well as NBC Universal.

Alan Boyle is science editor at msnbc.com. Connect with the Cosmic Log community by "liking" the log's Facebook page, following @b0yle on Twitter or adding the Cosmic Log Google+ page to your circles. You can also check out "The Case for Pluto," my book about the controversial dwarf planet and the search for new worlds.

Discuss this post

While I like the new application of this type of device for speech, the idea of the DiVA to me is not a new discovery. This reminds me of the Theremin (etherphone), a musical instrument invented in 1928.

The Theremin is different in that it produces similar sounds as this device (minus consonant and vowel sounds), but without wearing any gloves, just using gestures. "The theremin is almost unique among musical instruments in that it is played without physical contact. The musician stands in front of the instrument and moves his or her hands in the proximity of two metal antennas. The distance from one antenna determines frequency (pitch), and the distance from the other controls amplitude (volume)." (http://en.wikipedia.org/wiki/Theremin)

I've included two vintage videos of a Theremin so you can see it in action. Note the gestures that are needed to get the notes just right.

The Swan

http://www.youtube.com/watch?v=YM0TsKH3HBQ&feature=related

Moonlight:

http://www.youtube.com/watch?v=tlESVpRxIhk&feature=related

  • 2 votes
Reply#1 - Sat Feb 18, 2012 7:50 PM EST

You can think of this as a theremin that talks ... which is more complex than your run-of-the-mill theremin. That being said, even a run-of-the-mill theremin is pretty cool. In fact, the effect is used in the Cosmic Log theme song, written by yours truly and performed by "rocker scientist" James Emley:

http://msnbcmedia.msn.com/i/MSNBC/Components/CSS/Audio/2011/cosmic.wav

  • 3 votes
Reply#2 - Sat Feb 18, 2012 8:12 PM EST

I was just going to say that it sounds like a tarted-up theremin. A friend of mine is quite the theremin prodigy - he can almost make it talk.

    #2.1 - Sun Feb 19, 2012 7:58 AM EST
    Reply

    this is cool, but i don't understand why nobody really mentions how far-reaching this can be...this could be a boon for deaf people like me...i'm tired of always asking for pen and paper just to write down something like "two coffees, please...one with cream and sugar, the other black"...etc. etc. of course it's too unwieldy now but maybe someday they'll improve on it so much that it could be woven into our shirtsleeves or gloves or whatever...

      Reply#3 - Sun Feb 19, 2012 1:29 AM EST

      Kind of reminds me of the device used by the gorilla in the movie "Congo" that enabled her to synthetically verbalize her sign language gestures. Just goes to show even mediocre sci-fi films can occasionally be prescient.

        Reply#4 - Sun Feb 19, 2012 7:45 AM EST

        I would think that it would, as the article states, be a real boon to people trying to learn foreign languages. Having tried unsuccessfully to learn Vietnamese and, with limited success, trying to teach Japanese speakers the difference between "l" and "r", I can see that inflection could be more easily taught if you had external indicators, such as hand positions, that assisted you. Maybe hands can learn what ears can miss.

          Reply#5 - Sun Feb 19, 2012 11:10 AM EST

          A more direct and, in my opinion, useful application, would be to use it to interpret sign-language. Imagine a cell-phone size device connected to a glove that tells us non sign-readers what the signer is saying.

          • 1 vote
          Reply#6 - Sun Feb 19, 2012 1:39 PM EST

          now that's an app!

            #6.1 - Sun Feb 19, 2012 1:44 PM EST

            Well, it has been shown that the Kinect can be hacked to read sign language with at least 98 percent accuracy. So, this may be better for hands-free than having to wear a device or glove. If the Kinect can read sign language, then it is just a matter of having the text generated read through any text-to-speech software application.

            That said, the practicality of this is hard to come by as you have a the DiVA, a glove device that must be manufactured, purchased, and worn by a person versus a sensor system like the Kinect that could be applied to anyone without special equipment but has to be a certain distance away from the individual to be able to "see" the entire field of view.

            I would think, for the application mentioned above, a device that can read and interpret sign language for any observer, that if the Kinect (the new commercial near-field one that MS just released) could be shrunken down and placed in a cell phone instead of the regular cell phone cameras (so that the cell phone could have the structured light system of the Kinect that can assign distances to pixels), then you can have a portable device that identify and then read to someone, either aloud or present on screen in text exactly what the person is saying with sign language.

            Actually, Microsoft, if you are reading this...why not? Take all the Kinect sensors and shrink them down into a cell phone (hey, it is a good use of a Windows phone and would definitely make your phones stand out against Apple and Android phones). The apps that you could make, from a cell phone that can read sign language, to one that can scan objects or even be used as a radar type system for object avoidance for people that are blind would be great.

            Kinect used for Sign Language:

            http://www.youtube.com/watch?v=qFH5rSzmgFE

            http://www.msnbc.msn.com/id/42419609/ns/technology_and_science-games/t/new-hack-lets-xbox-kinect-read-sign-language/#.T0GoVly42hQ

            Kinect used to Navigate for Visually Impaired:

            http://www.youtube.com/watch?v=l6QY-eb6NoQ

            Kinect for Object Identification:

            http://www.youtube.com/watch?v=fQ59dXOo63o

            Now imagine all the Kinect sensors placed into a cell phone...

            Oh, the Things you could do!

            • 2 votes
            #6.2 - Sun Feb 19, 2012 9:13 PM EST

            very interesting...i recently had an idea for using kinect as a sign language game, or as a way to teach sign language...nice to see the concept at work...

              #6.3 - Sun Feb 19, 2012 9:25 PM EST
              Reply

              Oh, I just read Joe420er post. I agree totally that if the development of this product were to reach production, this is a useful application. Who knows, maybe the demand for such an device be "profitable" and help pay for future generation applications.

                Reply#7 - Sun Feb 19, 2012 1:47 PM EST

                There was another device called the Voder, invented in 1939, which was an early speech synthesizer. It used pedals and switches to create vowels, consonants and inflections.

                  Reply#8 - Sun Feb 19, 2012 11:47 PM EST
                  You're in Easy Mode. If you prefer, you can use XHTML Mode instead.
                  As a new user, you may notice a few temporary content restrictions. Click here for more info.