Second-Class Languages
March 22, 2015 8:00 AM   Subscribe

 


The most recent update to the Unicode standard included the entire alphabet of Linear B, an ancient Mycenaean script that was not deciphered in the modern era until the 1950s.

Nobody wants Menelaus angry with them.
posted by GenjiandProust at 8:07 AM on March 22, 2015 [13 favorites]


I find it pretty shocking the Bengali isn't fully implemented in Unicode. This is not an obscure language.
posted by mr_roboto at 8:15 AM on March 22, 2015 [13 favorites]


There was a pretty interesting discussion about this article on /r/programming a few days back. Also some good commentary on HackerNews including back and forth between various Unicode experts.

It turns out that the article misrepresents how the UTC works and how Bengali was added to Unicode. There is even disagreement between native speakers on if the character can be written correctly and how it should be implemented.

Also Emoticons were developed in Japan by non-westerners.
posted by humanfont at 8:19 AM on March 22, 2015 [25 favorites]


I find it pretty shocking the Bengali isn't fully implemented in Unicode.

It is, it's just that this guy disagrees with how it is implemented. He's also full of himself. In the hacker news thread, he uses an "are you even a native speaker" argument against a native speaker who disagrees with him.
posted by effbot at 8:25 AM on March 22, 2015 [18 favorites]


Nobody wants Menelaus angry with them.


Ah, Menelaus isn't so tough.

Now- Achilles, on the other hand...
posted by TheWhiteSkull at 8:33 AM on March 22, 2015 [1 favorite]


I appreciated the goal of this article, but like folks say there's a technical argument that's not just about cultural imperialism. Combining forms are just fine in the Unicode standard, you don't always need nique allocated characters, although then the operating system does need to support them correctly. This particular case of Bengali is remarkably complicated, it turns out. Also the character ৎ he's complaining about actually was implemented back in 2005, well before the pile of poo 💩 that makes for such a great (if misleading) headline.

If you want to get mad about cultural bias in Unicode, look to Chinese/Japanese/Korean unification. Or the long tortured process of adding all the Chinese words to the standard. Or the way that UTF-8 encoding for Chinese is remarkably inefficient.

If you want to get mad about cultural bias in character handling in general, get mad at all the software that still thinks ASCII-only is somehow acceptable. ASCII doesn't even work for American English; Española is a real town in America, you know? And it certainly doesn't work for any other European language. A particular pet peeve of mine is the term "special characters", as if somehow the letters of anything other than a restricted form of American English were "special".
posted by Nelson at 8:34 AM on March 22, 2015 [17 favorites]


I wrote my son email about going to the state fair and he wrote a response full of poop emoji. I said "why did you send me this?" and he said "i thought it was cool they had an emoji for funnel cake."
posted by escabeche at 8:54 AM on March 22, 2015 [44 favorites]


Yeah, the one legitimate complaint in this article was fixed in Unicode a full decade ago. The rest appears to boil down to "I don’t want to have to type a character in my language using Unicode combining characters". To which the universal response is: that’s how Unicode works. It’s how Unicode has always worked. What exactly is the difficulty here?

Frankly, there’s a list of problems in computing which can be laid at the door of the predominantly white, male, western focus of the dominant software companies but Unicode is not one of them.

The "are you even a native speaker?" attempted put down in the HN thread was particularly choice.
posted by pharm at 8:54 AM on March 22, 2015 [5 favorites]


Worth reading for the information that MOBY DICK IN EMOJI exists.
posted by Johnny Wallflower at 9:05 AM on March 22, 2015 [4 favorites]


From the Hacker News link, I thought this was interesting:
It seems to me that the high-level issue here is that Unicode is caught between people who want it to be a set of alphabets, and people who want it to be a set of graphemes.

The former group would give each "semantic character" its own codepoint, even when that character is "mappable" to a character in another language that has the same "purpose" and is always represented with the same grapheme (see, for example, latin "a" vs. japanese full-width "a", or duplicate ideograph sets between the CJK languages.) In extremis, each language would be its own "namespace", and a codepoint would effectively be described canonically as a {language, offset} pair.

The latter group, meanwhile, would just have Unicode as a bag of graphemes, consolidated so that there's only one "a" that all languages that want an "a" share, and where complex "characters" (ideographs, for example, but what we're talking about here is another) are composed as ligatures from atomic "radical" graphemes.

I'm not sure that either group is right, but trying to do both at once, as Unicode is doing, is definitely wrong. Pick whichever, but you have to pick.
Does anyone know if this is an accurate assessment? It seems like a root philosophical decision that should have been made at the start....
posted by GenjiandProust at 9:06 AM on March 22, 2015 [7 favorites]




If you want to get mad about cultural bias in character handling in general, get mad at all the software that still thinks ASCII-only is somehow acceptable.

This is from 2003, was years overdue then, and yet even in 2015 I still occasionally have to nudge people to read it:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
posted by swr at 9:28 AM on March 22, 2015 [9 favorites]


and from that very Joelson article: "If a letter's shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don't have to worry about it. They've figured it all out already."
posted by Lanark at 9:30 AM on March 22, 2015


Another thing is that to a great degree Unicode is just a bunch of pre-existing encoding schemes smooshed together without especial rhyme or reason. For example, (from 2009/Unicode 6.0)
a group of five characters representing specific cultural icons (Mount Fuji, Tokyo Tower, Statue of Liberty, Silhouette of Japan and Statue of Moyai) have been vigorously opposed because they give the appearance of setting a precedent for encoding hundreds of other characters representing cultural or nationalistic icons, such as the Great Wall of China, the Pyramids of Giza, the Eiffel Tower, Tower Bridge, Mount Kilimanjaro, etc. etc. Some of us would have prefered to encode generic versions of these characters (e.g. Snow-Capped Mountain instead of Mount Fuji), but Google insisted that these characters had specific semantics that generic versions of the characters would not be able to represent, so in the end they were accepted as is. Note however, that they are not precedents for encoding other characters representing cultural icons, as they were not encoded because of the importance of the objects these characters represent, but for interoperability reasons (cross-mapping to existing emoji codes). Of course, if mobile phone vendors start adding emoji for the Great Wall of China, etc. then ....
posted by XMLicious at 9:35 AM on March 22, 2015 [1 favorite]


...fucking emojis.....fuck fuck fuck fuck.

have a nice day.
posted by mule98J at 9:39 AM on March 22, 2015 [4 favorites]


the practical implementations of bengali on both windows and OS X are abysmal. avro keyboard works well enough on windows, but their OS X port feels a bit lacking and randomly switches between english and bengali...

what good is installing a language or having it built in to UNICODE if it's a pain to type it? ;/
posted by raihan_ at 9:41 AM on March 22, 2015


The Oral History of the Poop Emoji
posted by at 9:09 AM on March 22


Huh. I thought that emoji actually represented the Gadsden flag.

"Don't tread on me"
posted by fredludd at 9:51 AM on March 22, 2015


"Don't tread on me"

Technically, the poop emoji also has that covered.
posted by GenjiandProust at 9:59 AM on March 22, 2015 [23 favorites]


Maybe he should just change his name to ┻━┻ ︵ヽ(`Д´)ノ︵ ┻━┻ .
posted by delfin at 10:02 AM on March 22, 2015 [1 favorite]


"Don't tread on me"

I took a moment to look for T-shirts with the poop emoji and the Gadsden motto, but no luck.
posted by fredludd at 10:05 AM on March 22, 2015 [5 favorites]


I do sort of agree that the input interface is really more important -- to some degree, I don't think it matters how the back end "assembles" the word as long as you can type in a reasonably "natural" way. I have used a couple of electronic Japanese dictionaries, and I like the one on my phone well enough, but if I can't guess the pronunciation of an unfamiliar compound in a couple of tries, I am reduced to an annoying process of assembling the kanji out of component parts which seem a bit randomly assigned. And, if the kanji I am looking for is archaic (or even in one of those faux seal script fonts) , I am really out of luck.

I can understand how the writer in the original article could be annoyed at having to assemble his name in a non-intuitive way every time he needs to write it -- that has to be alienating -- but surely some of the problem is front-end?
posted by GenjiandProust at 10:06 AM on March 22, 2015 [1 favorite]


I am shocked, shocked, that a Model View Culture article would turn out to be an inflammatory oversimplification of a complex problem.
posted by strangely stunted trees at 10:15 AM on March 22, 2015 [13 favorites]


I don't want texts from scat freaks anyway, no matter where they are from
posted by thelonius at 10:23 AM on March 22, 2015 [2 favorites]


Unicode is just a bunch of pre-existing encoding schemes smooshed together without especial rhyme or reason.

That's really not true. The text quoted in support of that claim is a great example of the process of reason. One of the goals of Unicode is to allow lossless round trips between all the other historical encodings of written language and Unicode. So the argument about why Mount Fuji U+1F5FB 🗻 is specifically in Unicode as "Mount Fuji" and not some generic mountain is that in whatever encoding they based this part of Unicode off of, it meant specifically that one mountain. And that for other reasons that character set was important / common enough to merit a treatment in Unicode. You may disagree with the specific decision of whether it was correct to include this character, but there is a principled basis for the argument.

I'd love to be able to put my hands on a cross-referenced historical treatment of every single Unicode character that explains where the character came from and why it's in Unicode in this way. The information is mostly there, in the formal proposals and then secondarily the discussions and critiques about those proposals. But I don't think it's all collected in one referenceable place. There's a huge amount of linguistic scholarship that went into creating Unicode, it's fascinating stuff.
posted by Nelson at 10:40 AM on March 22, 2015 [6 favorites]


I don't think it matters how the back end "assembles" the word as long as you can type in a reasonably "natural" way.

Unicode Khmer doesn't strictly speaking require people to key in characters in an unnatural order, but p. much every input system I've seen thus far requires them to do so. It's one of the reasons that so many people are still using Limon or Ekreach or etc etc etc custom-encoding fonts.

Most of the world's most complex writing systems have glitches like this - Unicode-plus-input-method-plus-OpenType-implementation chaos. Or they have "backend assembly" problems where you key them correctly and then some of your platforms render your stacks in one way and other renders them in another way. (MS Word versus Adobe InDesign is the classic example here.)

If you try to do the thing you suggest - write an input method that lets people type in an "intuitive" way - then you wind up writing a whole hell of a lot of code to dodge around the limitations imposed by encoding + badass font tech + intended platforms. When, really, one would expect a universal encoding + badass font format tech to make it a Write Once, Render Everywhere The Same Way situation. I would expect it, if the universal encoding system had been under development since the late 1980s.

I'm okay with the status quo, as it pays my bills, but if my birth tongue was written in a complex script I would be Very Angry.
posted by BrunoLatourFanclub at 10:50 AM on March 22, 2015 [3 favorites]


write an input method that lets people type in an "intuitive" way - then you wind up writing a whole hell of a lot of code to dodge around the limitations imposed by encoding + badass font tech + intended platforms.

Perhaps, but the old ASCII notion that one key press = one code point = one glyph in one character cell has been obsolete for ages, so you have to write a whole hell of a lot of code anyway. Or use a library.
posted by effbot at 11:37 AM on March 22, 2015


I wrote my son email about going to the state fair and he wrote a response full of poop emoji. I said "why did you send me this?" and he said "i thought it was cool they had an emoji for funnel cake."

Ahem
posted by kmz at 11:48 AM on March 22, 2015 [5 favorites]


So the argument about why Mount Fuji U+1F5FB 🗻 is specifically in Unicode as "Mount Fuji" and not some generic mountain is that in whatever encoding they based this part of Unicode off of, it meant specifically that one mountain.

This is what I mean, sorry if my wording was vague: not that the process of smooshing them together itself has no rhyme or reason or rules or formality, but that much of the time it's a snowballing of decisions like this made by disparate parties all over the world at different times under different requirements and motivations.

In some cases they were decisions concerning computer systems that were extremely short-lived and evanescent even by our own standards only a few decades into the era of computers. I'm sure there's always a reason for inclusion of any given glyph at least in the form of backwards compatibility with some hardware or application or font system that was used for a few years during the twentieth century which a particular member of the consortium cares about; it's just that as far as understanding how you might implement Unicode or how you might use it, you have to realize that—for good or bad—those sorts of things are put on a level not too distant from the particulars of mathematical notation or accuracy in capturing the variations in writing systems that have been used across hundreds or thousands of years, the "huge amount of linguistic scholarship" as you mention there.
posted by XMLicious at 11:48 AM on March 22, 2015 [1 favorite]


Oh yeah we're in total agreement then. Another way to put this is Unicode is by design a big hodge-podge of historical encodings of writing systems. There's a lot of weird freaky stuff in the fringes of Unicode. But it mostly works; no one really cares too much about the details of which Emoji got encoded where. It's when you can't write your name easily in your native language that it feels much more personal.
posted by Nelson at 11:55 AM on March 22, 2015 [2 favorites]


"are you even a native speaker?"

bro, do you even lift?
posted by desjardins at 12:22 PM on March 22, 2015 [4 favorites]


Even crazier than just Unicode itself are the encocding schemes for various data storage. A web form on a specific browser and talking to a specific web server might encode the characters in all many of crazy ways. Then the relational database clob and nvwrchar/varchar/text fields can be configured in many different ways, some of which might need to also be specified in the database middleware/driver. For example was the database configured for usascii 7 bit encoding or 8 bit Latin one character, UTF8, UTF16, etc. What did the browser send, an html escaped Unicode character preceded by ampersand, or the actual encoded bytes as a UTf-8 character. It can get really crazy when users start copying and pasting text from other webpages. You might end up with data from the user that mixes escaped characters and raw bytes.
posted by humanfont at 12:25 PM on March 22, 2015


The Swedish alphabet has three extra characters compared to the English alphabet, and they're written as "Å", "Ä" and "Ö". All three of them are vowels, and are unique letters unto themselves, they are not just "A" and "O" with diacritics. But if you look up the characters in Unicode, you find out that (for instance) "Ö" is "Latin Capital Letter O with diaeresis", even though it's emphatically not. A diaeresis is a diacritical mark used to indicate that two vowels should be pronounced in different syllables and not as a diphthong, like at the end of the names "Zoë" and "Chloë", and when the New Yorker snootily writes "coöperation". That has nothing to do whatsoever with what "ö" means in Swedish, where it's an entirely different thing.

So, when I type my Swedish "ö", do I feel the crushing yolk of English imperialism just because the Unicode consortium has decided that the name of the character that comes out when I press the "ö" key on my keyboard (yes, we have special keyboard layouts with "å", "ä" and "ö" added) is called "Latin small letter o with diaeresis" instead of "Swedish small letter ö"? No, of course I don't! I don't care what the character is called in Unicode, nor do I care at all where it appears in the list. What I do care about is that I have a keyboard that has keys for the extra Swedish alphabet characters, and that when I press them, the right letters appear. The technical details of how exactly those letters are encoded are totally uninteresting to me.

If Unicode can render this character, which apparently it absolutely can, what does it matter if it's a combination of one, two or three codepoints? Who gives a damn? If it's hard to write the character, blame keyboard manufacturers, not Unicode, clearly the problem lies with them.
posted by gkhan at 1:19 PM on March 22, 2015 [6 favorites]


Here's a fun one:

1. Create a git repository on a mac, and check in a few files that contain various unicode characters such as ü in their file names.
2. Checkout said git repository with a Linux box, make some changes to those files, and try to commit your changes.
3. ???
4. Panic.

Yes, the underlying issue has been fixed, but there is indeed some weird stuff on the fringes of unicode that occasionally makes things like string comparisons completely fall apart.

But, really, it's truly remarkable how much of Unicode "just works" on modern platforms. I consider Unicode to be one of the biggest accomplishments of modern computing.
posted by schmod at 1:23 PM on March 22, 2015 [1 favorite]


The larger point stands though. For historical reasons, unicode privileges latin letters. Imagine the Latin alphabet didn't exist in unicode and you were in the process of adding it. Given how non-latin languages have been added (han unification etc), the likely result of adding latin now would result in things like upper case characters as lower case plus a combining character and other things like that.
posted by R343L at 1:36 PM on March 22, 2015 [1 favorite]


So, when I type my Swedish "ö", do I feel the crushing yolk of English imperialism?

Eggsactly! Brits, however must submit to your yoke when they have to call a kålrot a "swede." Here in America, we proudly use the freedom-drenched term "rutabaga."
posted by Johnny Wallflower at 1:38 PM on March 22, 2015 [6 favorites]


During the Apple WWDC rollout of their new Swift programing language they showed how one could write a function using an emoji character. Since then I've had nightmares about the unmaintainable code that is going to one day find its way to my door. Some monster right now is surely writing some crucial set of iOS apps with an internal lib composed of emoji named functions and variables for lols.
posted by humanfont at 2:13 PM on March 22, 2015 [6 favorites]


The Swedish alphabet has three extra characters compared to the English alphabet, and they're written as "Å", "Ä" and "Ö".

Sort of previously.
posted by GenjiandProust at 2:16 PM on March 22, 2015


Johnny Wallflower: "Eggsactly! Brits, however must submit to your yoke when they have to call a kålrot a "swede." Here in America, we proudly use the freedom-drenched term "rutabaga.""

"Rutabaga" comes from Swedish, so THERE IS NO ESCAPE.
posted by Joakim Ziegler at 3:00 PM on March 22, 2015


humanfont: "Some monster right now is surely writing some crucial set of iOS apps with an internal lib composed of emoji named functions and variables for lols."

This will make obfuscated code contests a lot more fun, though.
posted by Joakim Ziegler at 3:01 PM on March 22, 2015


Somehow, I'm almost afraid to check to see if emacs have implemented keyboard shortcuts that somehow require emoji...
To save your file, press Ctrl-Meta-Up-🐕 while tilting your smartphone 30° to the left
posted by schmod at 3:36 PM on March 22, 2015 [3 favorites]


But if you look up the characters in Unicode, you find out that (for instance) "Ö" is "Latin Capital Letter O with diaeresis", even though it's emphatically not.

And of course, it can be represented internally as either that code point, or as two code points (LATIN CAPITAL LETTER O + COMBINING DIAERESIS) which in turn is represented as different byte patterns depending on encoding. And to type it, you might press Ö or ¨+O or combine+"+O or long press-O+Ö or some other variation, depending on what keyboard you're using.

Also, what lots of people are missing (especially people ranting about the han unification) is that Unicode encodes scripts, not languages; just knowing the code point doesn't tell you how to render something. Default to the wrong font, and multilingual users may get a headache. And you may get bug reports from your Japanese beta testers where they complain that those strange scribbles on the screen looks like Greek to them, even if it all looks like Japanese to you...

"Rutabaga" comes from Swedish, so THERE IS NO ESCAPE.

From a Swedish dialect, to be precise, where it supposedly was called "rotabagge" at some point in time. In ordinary Swedish, it's "kålrot" (cabbage root). Swedes probably only know the dialectal word these days because it made it into American English, so thanks for keeping old Swedish dialects alive!
posted by effbot at 3:46 PM on March 22, 2015 [1 favorite]


The "Han unification project" he describes sounds absolutely nuts. Crazy.

I also thought his attitude towards non - native speakers was pompous at best.
posted by Nevin at 4:05 PM on March 22, 2015


Han Unification was a bad idea, and it happened because they thought they could fit the world into 16 bits if they did it.

They were trying to compete with the original ISO-10646 proposal, which would essentially have allocated every script a 16-bit block, within which it could do whatever it wanted, with a few more bits at the front to specify the script. The Unicode committee wanted to limit the total size of any character to 16 bits for the same reason the ASCII committee originally wanted to limit its printing characters to 6 bits: because they thought nobody would implement a code that used a weird number of bits that was slightly larger than any computer's natural word size.

They were wrong: instead, we've got UTF-8, and nobody cares how many bits any particular character is any more, except when you're writing code to pack and unpack it. But it happened and it's too late to undo it now.
posted by enf at 4:19 PM on March 22, 2015


To save your file, press Ctrl-Meta-Up-🐕

What's ☝🐕? Oh, right.
posted by ambrosen at 5:16 PM on March 22, 2015 [3 favorites]


Yeah, I had a hand in attempting to get a standardized Unicode situation for Classical Mongolian (itself related to Classical Tibetan, the written/printed style of Tibetan Buddhist sutras and historical accounts), while the subject matter experts themselves were in at least two, umm, strongly disagreeing camps. The NGO I was with desperately needed a intelligible encoding for the transcriptions and translation tools we were creating. Last I checked, issue still not resolved.

Classical Tibetan texts have been printed since before Gutenberg's day. They had both full page ("image") and individual character ("type") options. And we still can't decide on an electronic encoding/typing script?!
posted by Dreidl at 5:43 PM on March 22, 2015 [2 favorites]


Going back to the Han unification concept, it just seems to be the ultimate heights of hubris and arrogance. In East Asia, central governments or authorities take it upon themselves to standardize their country's take on Han script (itself an exercise in arrogance, if not exactly hubris).

Japan, Taiwan, S Korea, Vietnam and of course China all have their own take on Han characters. In many cases (notably simplified Chinese), the radicals (the building blocks of the Han writing system) have been totally transformed between writing styles.
posted by Nevin at 9:32 PM on March 22, 2015


Unicode isn't absolutist about Han unification, and indeed isn't really doing it anymore. It was a technical compromise to fit Chinese writing into the 65,536 characters of the Basic Multilingual Plane, which I think is still the only part of Unicode supported by Windows XP and some popular graphics frameworks. Browsers all use UTF-8 these days, though, so they don't have to be so constrained, and you can get a bunch more ideographs in (so far) four "Extension" planes, plus one for just punctuation, one for strokes, and one for radicals.
posted by LogicalDash at 5:46 AM on March 23, 2015 [1 favorite]


I guess it's still "unified" in the sense that there aren't separate planes for different Han languages? That would pose its own political problems anyway because you'd have to decide which language "owns" a given character, or else duplicate it when you know perfectly well it's the same, and cause exciting new compatibility issues wrt. sorting, glyph selection, and input methods.

Only there ARE separate planes for hiragana, hangul, and bopomofo.

What I'm getting at is that the problems Nevin is referring to have been addressed, perhaps not in the ideal way, because there is no ideal way.
posted by LogicalDash at 5:54 AM on March 23, 2015


Yeah anything that supports only the Unicode BMP is broken by design. One of the nice things about the popularity of emoji is that they live on a supplemental plane, so it forces out all the bad old software that assumes every Unicode character fits in 16 bits.
posted by Nelson at 7:08 AM on March 23, 2015 [1 favorite]


During the Apple WWDC rollout of their new Swift programing language they showed how one could write a function using an emoji character.

I can still feel the raw, empty socket in my soul where I had previously had some hope before that moment, like as if I touched a spot where a tooth used to be with my tongue.
posted by phearlez at 12:48 PM on March 23, 2015 [1 favorite]


What's New in Unicode 8.0: a readable overview of what's coming in the June 2015 version of Unicode. Posting it here because the blog posts documents some of the meticulous detail that goes in to Unicode proposals. See for example the Proposal to encode Gujarati Letter ZHA, whicih documents the need for a new character to capture a sound that's not common in Gujarati but is used to write Parsi loan words for Zoroastrian texts. This work is a great example of how the choice of characters in Unicode is not in any way arbitrary but the result of a lot of careful scholarship.
posted by Nelson at 4:07 PM on April 1, 2015


The taco emoji is used by many of the most erudite and urbane scholars, which is undoubtedly why as Andrew notes there it was added with "unseemly haste".
posted by XMLicious at 5:52 AM on April 2, 2015


Yeah, U+1F32E TACO is kind of dumb. FWIW in general the set of Emoji is decided by what had been previously encoded in proprietary encodings. For Emoji, that's mostly been Japanese phones. At least I know the original Emoji set was pretty carefully vetted and haggled over; Unicode TR51 has references to the origin of them. The notorious PILE OF POO for instance was F6CE in KDDI Shift-JIS and F99B in SoftBank Shift-JIS. To the extent Unicode is supposed to envelop all the Shift-JIS variations, Unicode needs a PILE OF POO.

But I think for Emoji this historical justification is not the only thing any more and someone's inventing some stuff. Particularly the new skin tone modifiers, I'm not aware of any precedent for them. I'm going out on a limb a bit though. If anyone closer to the Unicode Consortium knows more I'd be glad to be corrected.
posted by Nelson at 7:39 AM on April 2, 2015


(fwiw, I've been having serious discussions with members of the consortium about adding a :facepalm: emoji. I think the world needs one.)
posted by effbot at 8:01 AM on April 2, 2015


« Older I told them all at the very first meeting it was...   |   We had no choice but to disappoint him. Newer »


This thread has been archived and is closed to new comments