Essays in English yield information about other languages
October 1, 2014 9:06 AM Subscribe
Essays and longer texts written in English can provide interesting insights into the linguistic background of the writer, and about the history of other languages, even dying languages, when evaluated by a new computer program developed by a team of computer scientists at MIT and Israel’s Technion. As told on NPR, this discovery came about by accident, when the new program classified someone as Russian when they were Polish, due to the similarity in grammar between the languages. Researchers realized this could allow the program to re-create language families, and could be applied to people who currently may not speak their original language, allowing some categorization of dying languages. More from MIT, and a link to the paper (PDF, from the 2014 Meeting of the Association for Computational Linguistics).
This seems like the logical conclusion of the Noam Chomsky school of linguistics, which says that everything you need to know about all the languages of the world can be learned by studying English syntax, because they're all basically the same as English below the surface, with some small tweaks.
This just takes that to the next level: instead of doing all that tricky and time-consuming historical linguistics work, you just take people from around the world, get them to learn English, and have computers analyze how they write English.
(Sorry, it *is* very clever and it's interesting and cool that this can be done, I just couldn't resist the snark.)
posted by edheil at 9:42 AM on October 1, 2014 [1 favorite]
This just takes that to the next level: instead of doing all that tricky and time-consuming historical linguistics work, you just take people from around the world, get them to learn English, and have computers analyze how they write English.
(Sorry, it *is* very clever and it's interesting and cool that this can be done, I just couldn't resist the snark.)
posted by edheil at 9:42 AM on October 1, 2014 [1 favorite]
Unfortunately, this is another case of the popular media overselling the research. I have no doubt that it's an interesting computational project, but the chances of us being able to classify dying languages this way is slim to none.
There's a host of problems just with the type of data that you would need. Speakers of dying languages are often natively bilingual in a dominant language; they don't make second-language mistakes. They're often disadvantaged and underrepresented online and in print. The linguistic situation in their communities is probably complex, meaning that knowing the linguistic background of a writer from that community requires detailed individual knowledge.
But more fundamentally, even given perfect data, this is classification based on typological or grammatical features, which is generally very unreliable. Many features are shared by languages due to chance or areal proximity, rather than because of an actual genealogical relationship. This could tell you that X and Y language have a lot in common, but that's not really a basis for classification.
The problem with the NPR coverage is compounded by the fact that computational linguists and linguists don't talk to each other as much as you might think.
posted by Kutsuwamushi at 9:44 AM on October 1, 2014 [9 favorites]
There's a host of problems just with the type of data that you would need. Speakers of dying languages are often natively bilingual in a dominant language; they don't make second-language mistakes. They're often disadvantaged and underrepresented online and in print. The linguistic situation in their communities is probably complex, meaning that knowing the linguistic background of a writer from that community requires detailed individual knowledge.
But more fundamentally, even given perfect data, this is classification based on typological or grammatical features, which is generally very unreliable. Many features are shared by languages due to chance or areal proximity, rather than because of an actual genealogical relationship. This could tell you that X and Y language have a lot in common, but that's not really a basis for classification.
The problem with the NPR coverage is compounded by the fact that computational linguists and linguists don't talk to each other as much as you might think.
posted by Kutsuwamushi at 9:44 AM on October 1, 2014 [9 favorites]
edheil, from the MIT article, the project started as an effort to create software that would identify the original language of someone writing in English, and then provide them language-specific guidance for correcting their writing, but that's were they found the bug that became the feature.
Kutsuwamushi, I agree that NPR oversold the program. Also, that coverage doesn't get into some details that are included in the MIT article, such as one goal for the project is to help populate the World Atlas of Language Structures (WALS), which is far from complete at this time. It's not a magical panacea for capturing information on dying languages, but another tool to aid in documentation.
posted by filthy light thief at 9:50 AM on October 1, 2014
Kutsuwamushi, I agree that NPR oversold the program. Also, that coverage doesn't get into some details that are included in the MIT article, such as one goal for the project is to help populate the World Atlas of Language Structures (WALS), which is far from complete at this time. It's not a magical panacea for capturing information on dying languages, but another tool to aid in documentation.
posted by filthy light thief at 9:50 AM on October 1, 2014
I agree that NPR oversold the program.
Not only did they oversell it, they either ignored or didn't understand some key claims of the authors, including: "We do not compare our clustering results to genetic groupings, as to our knowledge, there is no firm theoretical ground for expecting typologically based clustering to reproduce language phylogenies.” (p.23)
The researchers themselves reject the idea of using this research to recreate language families.
one goal for the project is to help populate the World Atlas of Language Structures (WALS)
But this, on the other hand, is an example of a problem that isn't with the coverage but with the authors, who, as computational linguists, make some claims that linguists might raise an eyebrow at.
I'm extremely skeptical that (a) they will have the type of data needed, and (b) that their results will be robust enough for inclusion. And that's given that they are discussing features of languages whose classification is already known. Since classification is one of the criteria for inclusion on WALS, a method that--as the authors state!--cannot classify languages cannot be used to add more languages to the database.
There is also a lot to be said about how not all features will be represented equally in second-language mistakes, but that's for someone who has more of a background in second-language acquisition than I do.
posted by Kutsuwamushi at 10:12 AM on October 1, 2014 [2 favorites]
Not only did they oversell it, they either ignored or didn't understand some key claims of the authors, including: "We do not compare our clustering results to genetic groupings, as to our knowledge, there is no firm theoretical ground for expecting typologically based clustering to reproduce language phylogenies.” (p.23)
The researchers themselves reject the idea of using this research to recreate language families.
one goal for the project is to help populate the World Atlas of Language Structures (WALS)
But this, on the other hand, is an example of a problem that isn't with the coverage but with the authors, who, as computational linguists, make some claims that linguists might raise an eyebrow at.
I'm extremely skeptical that (a) they will have the type of data needed, and (b) that their results will be robust enough for inclusion. And that's given that they are discussing features of languages whose classification is already known. Since classification is one of the criteria for inclusion on WALS, a method that--as the authors state!--cannot classify languages cannot be used to add more languages to the database.
There is also a lot to be said about how not all features will be represented equally in second-language mistakes, but that's for someone who has more of a background in second-language acquisition than I do.
posted by Kutsuwamushi at 10:12 AM on October 1, 2014 [2 favorites]
It would be interesting to run this on, say Latin texts written by Greek authors, or the Gospels which were Greek texts presumably written by native speakers of Aramaic.
posted by empath at 11:15 AM on October 1, 2014 [2 favorites]
posted by empath at 11:15 AM on October 1, 2014 [2 favorites]
« Older “There is such a thing as the courage in remaining... | 'Guns will get you into more trouble than they... Newer »
This thread has been archived and is closed to new comments
posted by acb at 9:24 AM on October 1, 2014