Tantalizing hints of the world beyond the virtual
July 8, 2011 2:13 PM   Subscribe

Languages of the World (Wide Web) — Google researchers graph cross-language links on the web, and "see a surprisingly clear map of Europe and Asia"
posted by blasdelf (24 comments total) 22 users marked this as a favorite
 
This is mindblowing, thanks for posting. In the total absence of links between Russia and "the West" you can still see, if you squint charitably, a relic of the historic fracture between the Eastern and Western Roman empire.
posted by eeeeeez at 2:43 PM on July 8, 2011


Yeah, it validates the idea that Russia, Turkey and Japan pretty inhabit their own isolated internet spaces. Curiously, though, there was no South Korea on the map.
posted by KokuRyu at 2:54 PM on July 8, 2011 [1 favorite]


Yeah, the gaps are more interesting than the links here. Another telling one is the non-connection between Greek, Turkish and Albanian.
posted by nebulawindphone at 3:15 PM on July 8, 2011 [1 favorite]


Malay and Indonesian are counted as separate languages?

I was also somewhat surprised to see Urdu and Hindi (apparently) have little link to one another; though perhaps this is because they use different scripts.
posted by dhens at 3:16 PM on July 8, 2011


"However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages." And fifty-five percent of the off-site links from English pages go to pages in other languages. So Lingua Anglica FTW.

I wonder how (or if) that will change as the Chinese and Hindi webs grow larger.
posted by Kevin Street at 3:29 PM on July 8, 2011


Interesting analysis.

To me the single most surprising thing was that Hindi was such a high-introversion language, both in absolute terms and as an outlier to the size/introversion correlation. If anything I'd expect the opposite given how many vibrant minority languages there are in India and how prominent English is in the country.
posted by strangely stunted trees at 3:59 PM on July 8, 2011


It might be a historical accident, so to speak. If the first sites in India were in English (so they could link to the wider Internet), the English-Indian web may have grown to cover the most common things people go online to do, like communication, posting LOLcats and so on, and captured a majority of users. Then the Hindi web developed later as something much more explicitly cultural and nationalistic, with less need to link to other places.
posted by Kevin Street at 4:10 PM on July 8, 2011 [1 favorite]


But the person who made this knows where all the countries in Europe and Asia are. Did they drag around the circles to make the fit look good? Or are the locations found automatically? (I'm leaning towards the first; there are other drawings of this graph that look just as nice on paper but don't quite resemble real geography as much.) That being said, this is a bit of a nitpick.

Can anyone find a link to that paper from a few years ago where someone did principal component analysis on the genes of Europeans and got something that looked a lot like a map of Europe?
posted by madcaptenor at 4:22 PM on July 8, 2011 [1 favorite]


Hindi may not grow much larger, as India is a multi-ethnic society, and learning english is commonplace. There are, IIRC, 21 mother-tongues, as well as the two national languages, English and Hindi. Since there are already ~450 million people who speak english as their first and only language, most of them in first world countries with large percentages of their population online, the path of least resistance is to learn english. It's the network effect in action.

I was surprised to see German as the overwhelming (non-english) destination of choice, tho... I would have thought France would have a larger showing.
posted by Slap*Happy at 4:41 PM on July 8, 2011


But the person who made this knows where all the countries in Europe and Asia are. Did they drag around the circles to make the fit look good? Or are the locations found automatically? (I'm leaning towards the first;

I had the same thought myself
posted by KokuRyu at 5:02 PM on July 8, 2011


I was surprised to see German as the overwhelming (non-english) destination of choice

German porn is pretty awesome.
posted by KokuRyu at 5:02 PM on July 8, 2011 [2 favorites]


German porn is pretty awesome.

I tried looking, but it all seemed pretty scheiße.
posted by jaduncan at 6:49 PM on July 8, 2011 [1 favorite]


The Hindi, Spanish, and Russian speakers all appear act like they're an apex language, as it were.

Interesting that the Serbs, Croats, and Germans have some interlinking, but the Turks, Greeks, and Armenians have none.
posted by rodgerd at 7:17 PM on July 8, 2011 [2 favorites]


Here's an interesting (but brief) blog post by a linguist musing about the data on Indian languages. He gives roughly the same explanation for high introversion in Hindi websites that Kevin Street does -- that those Hindi-speakers most likely to create websites will also be English-speakers and choose English over Hindi to reach a greater audience -- but is more interested in anomalies like the strong Nepali-Marathi link.

Since Nepali and Marathi are spoken on opposite ends of the subcontinent very little connects them aside from "the fact that they both are written in Devanagari script (also used for Hindi). Gujarati, Punjabi, and Bengali, on the other hand, are each written in their own scripts (distinct from Devanagari). So I wonder if there is any possibility that the script is creating "false hits" when the off-site link connections for Nepali and Marathi are being computed."
posted by villanelles at dawn at 8:46 PM on July 8, 2011


"see a surprisingly clear map of Europe and Asia"

Shenanigans. The reason that graph looks like a map is because somebody has hand-positioned the nodes to correspond roughly with country locations. I guarantee you it did not come out of a completely automatic graph layout algorithm looking like that.

So the take-away here is that geographic and cultural proximity of languages leads to higher incidence of inter-language hyperlinks. Which is interesting, but not too surprising.
posted by qxntpqbbbqxl at 10:31 PM on July 8, 2011


KokuRyu: "I was surprised to see German as the overwhelming (non-english) destination of choice

German porn is pretty awesome.
"

You like grannies, too, eh?
posted by symbioid at 12:27 AM on July 9, 2011


And how much of the Russian pages are on LJ?
posted by symbioid at 12:29 AM on July 9, 2011


Polish affiliation for Esperanto is an endearing historical trait.
posted by Meatbomb at 1:16 AM on July 9, 2011


Polish affiliation for Esperanto is an endearing historical trait.
posted by Meatbomb at 1:16 AM on July 9 [+] [!]


well yes, since zamenhof was born and raised in present-day poland. they have a cute grotesque statue of him in białystok, right next to a remarkable rolling marble ball fountain.

that hindi shows up so tiny is unremarkable. non-english computer keyboards are rare in india, where the vast majority of people with computer skills are somewhat proficient in english. this is compounded by the messy and vast linguistic space that exists. romanized versions of the regional languages abound.
posted by beshtya at 2:00 AM on July 9, 2011


Some of those German hits could be links to academic papers, which are frequently written in German (or English, or French).

The guys who wrote up the language-link results forget to consider, as an explanation for "surprising" links like Hindi->Swahili, the large number of mother-tongue speakers who are seeking (or once sought) economic opportunity in another country.
posted by subdee at 4:36 AM on July 9, 2011


Shenanigans. The reason that graph looks like a map is because somebody has hand-positioned the nodes to correspond roughly with country locations.

This is obvious from the many unnecessary edge crossings. I don't see why, for example, Albanian is not equidistant from German, Italian and French, minimizing distances and crossings and making a generally neater graph. Mongolian would also be better placed near the top, just off Polish, or near Serbian, after moving Armenian and Georgian further up.
posted by Dr Dracator at 10:38 AM on July 9, 2011


"I was also somewhat surprised to see Urdu and Hindi (apparently) have little link to one another..."

Hmm. I thought they were pretty much unrelated linguistically - boy was I wrong. Meanwhile, Wikipedia does have a very interesting article about Urdu and its history. [slinks away]
posted by sneebler at 11:46 AM on July 9, 2011


It would actually be difficult to chart out the Hindi web and perhaps other Indian languages, as technical and higher education in India are almost exclusively in English.

Most Indians don't type Hindi in the Devanagri script, either because they don't know how to or because they don't want the hassle of having to learn a whole new keyboard setup, so a lot of Hindi ends up arbritarily transliterated into Roman script, interspersed with English. From what I know, other Indian languages share a similar fate.

I'm guessing the article only took the Hindi web proper, which would explain the surprisingly high level of introversion there.
posted by Senza Volto at 3:42 AM on July 10, 2011 [1 favorite]


Also, that Malay and Indonesian are written in the Roman script would probably explain why they are not highly introverted. IIRC, English is taught and used widely in Malaysia and Indonesia on a similar level to India.
posted by Senza Volto at 3:45 AM on July 10, 2011


« Older Yao Ming Retires   |   Deep space. The silence of the void. Shhh. Newer »


This thread has been archived and is closed to new comments