Talk with your hands
April 18, 2006 7:06 AM Subscribe
In the US there are three major forms of manual communication ASL (American Sign Language, PSE (Pidgin Signed English or Contact), and SEE (Signing Exact English). Translating from English to any one of these is hard enough. That's not stopping this team from trying taking on the added challenge of a machine translation. I can't imagine them doing half as well as this man's efforts at live translating rap, switching between all three variants (video, with voice over).
That guy was hilarious. I almost fell out of my chair when he started signing english.
posted by Baby_Balrog at 7:36 AM on April 18, 2006
posted by Baby_Balrog at 7:36 AM on April 18, 2006
Machine translation of ASL is a nice idea, but I doubt it will work. For one thing, no matter how sophisticated the face cues are, they cannot match human facial expression, which is essential to communication in ASL. Signs have different meanings based on what are called non-manual markers (NMM's), including mouth shape, facial expression, body posture, and so forth.
Take the word "aggressive" for example. When signed properly in ASL, it includes subtle shifts in facial expression, including a narrowing of the eyes and a slight frown at the lips. Unless, that is, you want to indicate someone was very aggressive - in which case the eyes would widen. But how can a machine know what emotional gravity is being put behind the words?
Their approach to computer generated fingerspelling is inadequate, as it does not seem to allow for lexicalized spelling or easy transitions between letters.
It's a fascinating idea, but it's not going to replace human interpreters. (For the record, I am one of those human interpreters.)
By the way, the rap video is by Keith Wann, a well-known ASL comedian.
posted by etoile at 7:53 AM on April 18, 2006
Take the word "aggressive" for example. When signed properly in ASL, it includes subtle shifts in facial expression, including a narrowing of the eyes and a slight frown at the lips. Unless, that is, you want to indicate someone was very aggressive - in which case the eyes would widen. But how can a machine know what emotional gravity is being put behind the words?
Their approach to computer generated fingerspelling is inadequate, as it does not seem to allow for lexicalized spelling or easy transitions between letters.
It's a fascinating idea, but it's not going to replace human interpreters. (For the record, I am one of those human interpreters.)
By the way, the rap video is by Keith Wann, a well-known ASL comedian.
posted by etoile at 7:53 AM on April 18, 2006
Two thumbs up!
posted by Zombie Dreams at 10:39 AM on April 18, 2006
posted by Zombie Dreams at 10:39 AM on April 18, 2006
(For the record, I am one of those human interpreters.)
Yeah, that was my guess as soon as you started talking about facial cues. My wife used to make her living as an ASL interpreter in the days before carpal tunnel syndrome put an end to all that.
posted by Doohickie at 12:36 PM on April 18, 2006
Yeah, that was my guess as soon as you started talking about facial cues. My wife used to make her living as an ASL interpreter in the days before carpal tunnel syndrome put an end to all that.
posted by Doohickie at 12:36 PM on April 18, 2006
Today I got to interpret the first few minutes of "Baby Got Back" from English to ASL in a high school. I was ready to go the distance but the tape the student was playing cut off. It's not that I'm that good an interpreter, but it just so happens I know all the worlds and had practiced it at home. Nonetheless, I must stress it was an interpretation and not a 1:1 translation. Sir Mix-a-lot would have laughed. I did.
posted by eccnineten at 5:14 PM on April 18, 2006 [1 favorite]
posted by eccnineten at 5:14 PM on April 18, 2006 [1 favorite]
ShortEssayFilter:
As it turns out, etoile, there is a fairly sophisticated way to formally describe facial expressions: the Facial Action Coding System, based on the fairly-well-known musculature of the face. Certainly it may be harder to synthesize those images than it would be for handshapes, but it looks possible.* Anyway, a good description language such as FACS goes a long way toward solving the rendering problem, and the rest will probably be taken care of, in the long run, by faster processing and vastly increased storage. (Much the same way the crude efforts of early voice synthesis [.mp3 link] have given way to sophisticated "voice fonts" duplicating well-known speakers.)
The translation problem (But how can a machine know what emotional gravity is being put behind the words?) is, to my mind, more interesting. And hardly unique to ASL: If I translate 'He broke it' from English into Hebrew, do I use the light shavar 'break' or the intensive shibber 'shatter'? (Credit where credit is due: This example from J. Weingreen, A Practical Grammar for Biblical Hebrew, 2nd ed., p. 105.) When I translate 'Sit down!' into Russian, do I use a bare imperative, or do I add a pronoun for more force? For that matter, do I use the familiar singular or the formal/plural? How does a computer know?
Setting aside for the moment the philosophical question of whether computers can produce 'true' language without ethical agency (slithy popup!), one solution is to check for collocates. As another problem for an example, English 'fish' can be Spanish pez (fish as animal) or pescado (fish as meat; this example from the just-linked essay; compare to 'cow'/'beef'). Given a corpus, you (or rather, your computer) can find words that characteristically occur near 'fish'; given translations, you/it can determine whether they reliably predict a particular translation. In our example, the British National Corpus tells me 'fish' collocates with 'swimbladder' and 'jawless', which would probably indicate pez, as well as with 'gefilte' and 'tunny', which would come down favoring pescado. (That's "tuna" for Americans, BTW.) Or, returning to your original example, it seems like 'very' or extremes of spoken volume are probably predictors of intensified signs, whereas hedges such as 'kind of' or 'seems' probably predict the reverse. This is information that, assuming it were verified, could be accessible to a machine translation system, and could inform its selection of output gestures. Granted, no single collocate test is perfect—it's in the nature of statistical and exemplar-based models to be imperfect—but if you've got enough of them, you can often be pretty confident. (Compare to statistical spam filters: They don't need a [paid human classifier of phenomena] to know which way the wind blows.)
That said, I have to agree that this kind of thing isn't likely to put human translators out of business anytime soon, much less simultaneous interpreters. Machine translation of formal written texts in well-defined contexts is a bleeding-edge technology, and never mind asking a computer to translate rhyming lyrics, conversations, or poetry—things that even talented human translators do in many different ways.
* One rendering scheme that comes to mind: Represent each facial Action Unit as a sum of eigenfaces, and see whether you can get a good image of simultaneous Action Units by a linear combination of their eigenface descriptions. I have no idea how likely this is to work, mind you.
posted by eritain at 3:52 PM on April 19, 2006
As it turns out, etoile, there is a fairly sophisticated way to formally describe facial expressions: the Facial Action Coding System, based on the fairly-well-known musculature of the face. Certainly it may be harder to synthesize those images than it would be for handshapes, but it looks possible.* Anyway, a good description language such as FACS goes a long way toward solving the rendering problem, and the rest will probably be taken care of, in the long run, by faster processing and vastly increased storage. (Much the same way the crude efforts of early voice synthesis [.mp3 link] have given way to sophisticated "voice fonts" duplicating well-known speakers.)
The translation problem (But how can a machine know what emotional gravity is being put behind the words?) is, to my mind, more interesting. And hardly unique to ASL: If I translate 'He broke it' from English into Hebrew, do I use the light shavar 'break' or the intensive shibber 'shatter'? (Credit where credit is due: This example from J. Weingreen, A Practical Grammar for Biblical Hebrew, 2nd ed., p. 105.) When I translate 'Sit down!' into Russian, do I use a bare imperative, or do I add a pronoun for more force? For that matter, do I use the familiar singular or the formal/plural? How does a computer know?
Setting aside for the moment the philosophical question of whether computers can produce 'true' language without ethical agency (slithy popup!), one solution is to check for collocates. As another problem for an example, English 'fish' can be Spanish pez (fish as animal) or pescado (fish as meat; this example from the just-linked essay; compare to 'cow'/'beef'). Given a corpus, you (or rather, your computer) can find words that characteristically occur near 'fish'; given translations, you/it can determine whether they reliably predict a particular translation. In our example, the British National Corpus tells me 'fish' collocates with 'swimbladder' and 'jawless', which would probably indicate pez, as well as with 'gefilte' and 'tunny', which would come down favoring pescado. (That's "tuna" for Americans, BTW.) Or, returning to your original example, it seems like 'very' or extremes of spoken volume are probably predictors of intensified signs, whereas hedges such as 'kind of' or 'seems' probably predict the reverse. This is information that, assuming it were verified, could be accessible to a machine translation system, and could inform its selection of output gestures. Granted, no single collocate test is perfect—it's in the nature of statistical and exemplar-based models to be imperfect—but if you've got enough of them, you can often be pretty confident. (Compare to statistical spam filters: They don't need a [paid human classifier of phenomena] to know which way the wind blows.)
That said, I have to agree that this kind of thing isn't likely to put human translators out of business anytime soon, much less simultaneous interpreters. Machine translation of formal written texts in well-defined contexts is a bleeding-edge technology, and never mind asking a computer to translate rhyming lyrics, conversations, or poetry—things that even talented human translators do in many different ways.
* One rendering scheme that comes to mind: Represent each facial Action Unit as a sum of eigenfaces, and see whether you can get a good image of simultaneous Action Units by a linear combination of their eigenface descriptions. I have no idea how likely this is to work, mind you.
posted by eritain at 3:52 PM on April 19, 2006
Thank you so much for that detail, eritain! I can see how it would eventually be possible to have an electronic interpreter, at least from a technological standpoint. The cultural standpoint, though, will probably take more time to come around. SigningAvatar is just what it sounds like, but it isn't readily understood on the first try as a human interpreter is. It is possible to understand what SigningAvatar says, but it takes getting used to and thus isn't effective for real-world applications...at least not yet.
Another interesting piece of technology, by the way, is RALPH. This is a robotic hand that deaf-blind persons can use to communicate through the Rochester Method (in which everything is fingerspelled).
posted by etoile at 5:45 AM on April 20, 2006
Another interesting piece of technology, by the way, is RALPH. This is a robotic hand that deaf-blind persons can use to communicate through the Rochester Method (in which everything is fingerspelled).
posted by etoile at 5:45 AM on April 20, 2006
« Older Ghyslain off the Chizain | StillFree.com Newer »
This thread has been archived and is closed to new comments
posted by youarenothere at 7:31 AM on April 18, 2006