Everything you didn't want to know about Unicode
May 26, 2015 5:17 PM   Subscribe

 
Kind of insane does not do unicode justice.
posted by AlexiaSky at 5:40 PM on May 26, 2015 [14 favorites]


I could write at lengths about this, but I need to run out to dinner.

Suffice it to say, Unicode is arguably the greatest (and least appreciated) accomplishment of the information age.

The subset of use cases that it covers is remarkably thorough, and support for it is nearly ubiquitous worldwide. Every computing device in every country "speaks" the same character set. This kind of worldwide accord may very well be unprecedented in the documented history of mankind.

(Yes. There have been globalization-related drawbacks, eg Han Unification, which others in this thread are certain to point out – as is MetaFilter's way. However, many of the edge-cases that weren't captured in the original spec have gradually been added in subsequent revisions... )

The importance of Unicode makes me cringe whenever I see articles along the line of "10 Cute Emoji that Apple should add tomorrow!" Language should certainly be allowed to evolve, but Unicode is too damn important to leave in the hands of a single vendor, or hastily modify.

Also, don't forget the gargantuan effort that went into creating the CLDR. No software developer has an excuse not to internationalize their application, thanks to the wealth of locale data contained within the CLDR.

and now I'm late for dinner
posted by schmod at 5:41 PM on May 26, 2015 [57 favorites]


but it turns out I actually really did want to know all this
posted by postcommunism at 5:41 PM on May 26, 2015 [8 favorites]


I am very happy now that understanding Unicode is no longer part of my job description.

And yet, all the headaches are still better than life without Unicode.
posted by Banknote of the year at 5:42 PM on May 26, 2015 [9 favorites]


It isn't really Unicode that is crazy. Human writing systems are quirky and diverse. Any universal encoding scheme is going to be complex as a result of the underlying systems it wants to contain.
posted by humanfont at 5:43 PM on May 26, 2015 [16 favorites]


Any universal encoding scheme is going to be complex as a result of the underlying systems it wants to contain.

DING DING DING DING DING.

When the result (the set of all written human languages, evolved and invented) is complicated, then the solution will be complicated as well. The reason that we had problems before is that we put simplistic solutions like ASCII into place and then tried to bash languages that didn't fit into it.

Unicode isn't perfect -- if we could jump back 15 years and do it again, we'd do a few things differently (like not bother with UTF-1, a classic example of "Seemed like a good idea at the time") But Unicode, by and large, works well and more importantly extends well, even when we add things like emoji. Which you may think is dumb, but I'd wager about 1/6th of the world disagrees with you. Hell, emoji may be the closest to a universal language we've ever had, which is saying something about humanity. Probably something really horrible, to be honest.

We're on the verge of version 8.0, due out in a couple of weeks. Wonder how many new scripts we'll get. 7.0 last June gave us 23. The Unicode Consortium is aiming for a yearly release "until it's done" -- that is, until all known human written languages are encoded, then updates whenever languages evolve (read, emoji updates, really.)
posted by eriko at 6:04 PM on May 26, 2015 [19 favorites]


The "Redundant Codepoints" section doesn't make sense to me. Surely you'd want to be able to distinguish the retroflex click from the exclamation mark for non-display uses of the characters, like with screen readers, right?
posted by ddbeck at 6:14 PM on May 26, 2015 [4 favorites]


I'm not sure "Unicode makes string reversal hard!" is a particularly compelling example: when is it ever actually useful to reverse a string?
posted by We had a deal, Kyle at 6:21 PM on May 26, 2015 [1 favorite]


!msivitpircserp sselesu ruoy dna uoy kcuf
posted by lalochezia at 6:25 PM on May 26, 2015 [19 favorites]


People who don't understand Unicode are done to complain about it. It's like a next level *nix in that regard.
posted by boo_radley at 6:27 PM on May 26, 2015 [3 favorites]


The only really useful example I can come up with for string reversal off the top of my head is "for suffix searching". Apache Lucene, for example, supports wildcard searches out of the box, but not ones where the leading character is a wildcard (that is, you can do "foo*" but not "*oof"). You can cover simple suffix searches by building an index where the reverse of the string is stored as a separate field and convert suffix searches to prefix searches against the reversed field (using the reversed search term as well).
posted by axiom at 6:28 PM on May 26, 2015 [7 favorites]


Hell, emoji may be the closest to a universal language we've ever had, which is saying something about humanity. Probably something really horrible, to be honest.

I was walking the beach the other day and saw someone had carved a poop emoji into the sand. I x1F4A9; you not.
posted by RobotVoodooPower at 6:35 PM on May 26, 2015 [7 favorites]


Hell, emoji may be the closest to a universal language we've ever had,

That may be true but it isn't saying much. My siblings and I have an endless group text chain going which is full of jokes and photos and updates about our lives. It is full of emoji, but it is basically just fancy punctuation; even among close relatives who grew up together, figuring out what someone was trying to communicate through a string of pictures is difficult at best.
posted by Mars Saxman at 6:40 PM on May 26, 2015 [1 favorite]


*flips table*
posted by clvrmnky at 6:43 PM on May 26, 2015 [1 favorite]


I can understand why you might want separate Unicode entries for characters and diacritics; I have no idea why there isn't a standard autocorrect option to convert them into the correct merged character. Vietnamese was the bane of my existence at my old job for this very reason.
posted by ivan ivanych samovar at 6:47 PM on May 26, 2015


My favorite is when Unicode went to the Island of Magic

What
posted by prize bull octorok at 6:48 PM on May 26, 2015


One of my many jobs is testing unicodey stuff. Which is why I have a VM whose name is ☺. Besides, it makes me happy.
posted by miyabo at 7:02 PM on May 26, 2015 [6 favorites]


Yeah, but you flipped it left-to-right, soooooo...
posted by boo_radley at 7:08 PM on May 26, 2015


¯\_(ツ)_/¯
posted by infinitewindow at 7:09 PM on May 26, 2015


"Glitchr" is a Twitter account which is trying to use Unicode idiosyncracies to hammer the Twitter user interface.
posted by Chocolate Pickle at 7:10 PM on May 26, 2015 [3 favorites]


How have I never thought to use UTF-8 codepoints as VM names? I guess I'm subconsciously ASCIIist. Perhaps because I still have PTSD from the early days of Unicode implementation in Linux. It really used to be a terrible clusterfuck. 💩 in no way captures what an enormous pain in the ass it was early on. Not because the idea was bad, but because the spec is so unwieldy early implementations were uniformly terrible and mostly broken.
posted by wierdo at 7:13 PM on May 26, 2015 [2 favorites]


Back in the old days, I used isomorphic code points to create a spoof quonsar account and subsequent pair of comments -- to point out that it could be done and because I couldn't resist. I immediately contacted both quonsar and Matt, though. And Matt closed that hole.

I'm not sure what the limitations are now. In the ten years since then, I assume that there are somewhat standard approaches and libraries that allow people to create unicode userids as they like while guarding against lookalikes.

As someone who worked in web software in the late 90s (the dominant CMS at the time, used around the world by many of the largest websites) but before unicode was widely adopted and right when internationalization became a requirement, I absolutely love unicode. As others have said, it can be a nightmare, but it's paradise compared to the hell that preceded it.
posted by Ivan Fyodorovich at 7:14 PM on May 26, 2015 [3 favorites]


💩
posted by sammyo at 7:14 PM on May 26, 2015 [4 favorites]


I have a mental block when it comes to programming. I can see that all of the brackets and punctuation are piss-poor representations of what we want to express programmatically, because we're stuck in VT100-land forever as programmers. It confuses and pisses me off. I know BASH and Python fairly well, as I can't do my job without it, but every other language gives me the hate-fits at how poorly its syntax flows.

There's one that's different.

I've learned APL.

∇ LolButts
[1] ButtVector ← "BUTTS"
[2] LulzVector ← "LOL"
[3] 2 2 ⍴ LulzVector ButtVector
[4] → 1
[5] ∇

LolButts


That there? Unicode. You can copy and paste it from this web page into a GNU APL workspace on your Mac or Linux box, and it will run. The tyranny of ASCII is at an end.

Walk with me a bit further, as in discovering APL, I also discovered LISP, and Racket, it's playful little chum.

S-Expressions. The brackets in Racket are meaningless - they're simply syntactic sugar, and are all interchangeable with "( )" - if you think on this a moment more, even the parens are interchangeable with any visual delineator. It could be text color, or the background beneath it acquiring a texture...

Now think of it on its side. If, in Racket, "[" is the same as "{" is the same as "(" - why can't it also be the same as "⏜ "? Why can't your code editor place unicode characters downward as you craft your S-Expression?

Because you have a text editor rather than a code editor. It may look and act all fancy, but it's just a gussied up VT100 terminal, making unfortunate assumptions about your language and literacy.

Eventually, people will twig to the awesome potential of Unicode as a tool for thinking programmatically in your mother tongue, and ditch the shift-number-key line noise for something more clearly expressive.

Until then, onwards to the next Project Euler problem! As thinking in APL has me thinking in Math.

Thanks, Unicode!
posted by Slap*Happy at 7:27 PM on May 26, 2015 [18 favorites]


I really do like me some Unicode but every now and then it sneaks into some unexpected places. Take GDS2. The specification was written nearly 40 years ago, before some members of the Unicode consortium were born I'd imagine, and defines strings as a series of bytes. Usually rendered in ASCII.

A vendor sent us some cells and a subset of them were marked with a special string. Somehow, they inserted this string in Unicode (I have no idea how as I didn't think any editor supported Unicode). If you searched for this string in ASCII you don't find anything. If you search for the exact same string as a Unicode string copied from an email it works. Mysterious, until you look at the cell in question and see a strong of gibberish... "Is that... Unicode?!?"
posted by flyingfox at 7:52 PM on May 26, 2015 [1 favorite]


Let alone human languages, Unicode is boldly going where no one has ever gone before and implementing Vulcan gestures... 🖖

(That symbol is not even rendering for me in my browser window, but visitors from the glorious unicode future will no doubt appreciate it. I hope.)
posted by RedOrGreen at 8:05 PM on May 26, 2015 [5 favorites]


My browser didn't render it so I checked from my iphone... Not yet. I'll check back whenever I want to verify whether I'm in the glorious unicode future yet.
posted by isthmus at 8:15 PM on May 26, 2015 [1 favorite]


I'm running a recent version of Chrome and I see a Vulcan salute so I can confirm that the glorious unicode future has indeed arrived, at least here in my Philadelphia rowhome.
posted by Tomorrowful at 8:20 PM on May 26, 2015 [1 favorite]


Slap*Happy - it may please you to know that JavaScript supports UTF-8 variable names.
posted by djb at 8:23 PM on May 26, 2015


My browser didn't render it

Works on Safari, Firefox and Chrome on my gnarly old iMac. Maybe an upgrade from Netscape 1.0 on your Thinkpad 820 running AIX 4.3 is in order?
posted by Slap*Happy at 8:24 PM on May 26, 2015 [2 favorites]


So I just learnt that (simplified) Chinese can be written either left to right or right to left to right, even in (Roman-transliterated) Pinyin; hence, a vehicle belonging to China Post could be stencilled both China Post and Tsop anihc on the same vehicle. That's the level of complexity Unicode is trying to resolve here.

Then there's how Unicode treats Brahmi scripts, by rearranging typed characters to form glyphs.

And of course, my favourite: there are the zero-width non-joiner and the zero-width joiner, characters who's only task is to 'separate' other characters.
posted by the cydonian at 8:26 PM on May 26, 2015 [2 favorites]


Those "redundant codepoints" are how I get around file restrictions when I cussedly really want one of those characters the OS doesn't want you to have. The realization that I could because of Unicode support brought an odd giddiness.

Shapecatcher is a fun toy if you like poking about unfettered unicode possibilities or need some inroads on your unicode artwork.
posted by Ogre Lawless at 8:27 PM on May 26, 2015 [1 favorite]


Slap*Happy: "Eventually, people will twig to the awesome potential of Unicode as a tool for thinking programmatically in your mother tongue, and ditch the shift-number-key line noise for something more clearly expressive."

You should try a programming language that supports subscripts and superscripts in the notation.
posted by LastOfHisKind at 8:36 PM on May 26, 2015


Barely scratches the surface of unicode's complexity. Call me when you've finished reading up on normalization forms, case mappings, derived properties, regular expressions, collating algorithms, segmentation rules, script property inheritance, pluralization and calendar data from CLDR, the confusables list, han unification criteria, the punycode nameprep algorithm ...
posted by ead at 8:47 PM on May 26, 2015 [4 favorites]


You should try a programming language that supports subscripts and superscripts in the notation.

No, because then I'll get all spun up about how terrible orthodox math notation is, and start getting all sway-toothed street-preacher about Iverson and how he table-flipped the whole magilla for his own notation optimized for munging multidimensional arrays, and how fun it was to learn it...
posted by Slap*Happy at 8:48 PM on May 26, 2015 [2 favorites]


"Those 'redundant codepoints' are how I get around file restrictions when I cussedly really want one of those characters the OS doesn't want you to have."

Heh, I do that, too.
posted by Ivan Fyodorovich at 8:50 PM on May 26, 2015


Slap*Happy: "I've learned APL. "

Here's a fun thing in C#
posted by boo_radley at 9:15 PM on May 26, 2015 [4 favorites]


Tangentially-related to the APL discussion upthread: FiraCode: A monospaced font with programming ligatures.

Makes stuff like <= automagically appear as without modifying the underlying text.
posted by schmod at 9:31 PM on May 26, 2015 [7 favorites]


So I'm not the only one here who's used duplicate code points to troll a coworker into doing an @all in a large HipChat room, right?
posted by invitapriore at 10:00 PM on May 26, 2015


c#, you weird motherfucker:
            const string ಠ = "ok";
            const string ಠ_ಠ = "also fine";
posted by boo_radley at 10:17 PM on May 26, 2015


there are the zero-width non-joiner and the zero-width joiner, characters who's only task is to 'separate' other characters.

Which makes little sense to some monolingual from the Anglosphere, but there are many words in e.g. Persian that you simply can't spell correctly without the zero-width non-joiner. So I'm here to take umbrage over your use of the word "only."

*flips table*

And then:

I absolutely love unicode. As others have said, it can be a nightmare, but it's paradise compared to the hell that preceded it.

Okay.

*rights table, carefully rearranges flatware and smooths tablecloth*

There.
posted by BrunoLatourFanclub at 10:24 PM on May 26, 2015 [3 favorites]


there are many words in e.g. Persian that you simply can't spell correctly without the zero-width non-joiner

Can you explain this to me like I'm five? If it is zero width, it renders invisibly, no?
posted by wierdo at 11:47 PM on May 26, 2015 [1 favorite]


Well, Persian is basically a cursive script, right? The letters flow together. Well, most of them do, just as in the cursive you (probably) learned in grade school. Think of the cursive capital D. It doesn't have a ligature-thingy to connect it to the next letter, does it? Most cursive letters do, but the capital D doesn't.

Now imagine that you were using your computer, and it insisted on connecting a capital D with any following letters with that cursive ligature-thing. Wouldn't you like some way to tell your computer "Do not join these two characters with a line?"

Here is an explanation and an image of what it looks like in practice

posted by BrunoLatourFanclub at 12:11 AM on May 27, 2015 [9 favorites]


boo_radley: Here's a fun thing in C#

It's also a good example of the perils of doing things like that, since when I first skimmed the code I expected σ would be a wrapper around Select() the way Σ is a wrapper around Sum(). Of course if they were really committed to Unicode they could rename the language C♯.

Until then I'm going to keep calling it see-hash.
posted by traveler_ at 1:03 AM on May 27, 2015 [3 favorites]


right to left to right

For those of you who didn't already know, the technical term for this kind of writing in alternating directions is "boustrophedon", a Greek word that means roughly "in the manner of a turning ox". (An ox plows a field in one direction, turns, plows in the other direction...and so on.) I'd say I found that kind of delightful if I didn't want to sound like someone wearing a tweed jacket with leather elbow patches.
posted by Mr. Bad Example at 5:23 AM on May 27, 2015 [3 favorites]


  Until then I'm going to keep calling it see-hash.

Coctothorpe has a certain ring to it.

Hat tip to ead on how “wah wah wah Unicode is hard and stupid” is just for dabblers. Developing and maintaining multilingual collation systems for several European languages (f'rinstance, Welsh sorts a bit like this …) before Unicode made me the man I am today wibble.
posted by scruss at 5:25 AM on May 27, 2015 [1 favorite]


The idea of retrofitting all my regexes to cover the jillion characters of Unicode gives me sweats.

And the prospect of bad people using unicode characters to obscure XSS attacks in order to bypass filters actually worries me.
posted by wenestvedt at 6:21 AM on May 27, 2015 [1 favorite]


Any universal encoding scheme is going to be complex as a result of the underlying systems it wants to contain.

Yeah. And that's how it's supposed to be. See, Genesis 11:5-9.
posted by The Bellman at 7:38 AM on May 27, 2015


wenestvedt -- the number of programmers who understand regex and unicode is a small enough set it might as well be empty. I do find unicode vulnerabilities to be always very interesting, however.
posted by k5.user at 8:00 AM on May 27, 2015 [1 favorite]


👍 because this led to me to re-read The Oral History of the Poop Emoji, which led me to Poop-Boy, which led me to Soft-Serve Ice Cream Boy, which led me to discover that this "recently-discovered" resemblance (💩➡️🍦) was used as the basis for the plot of a 30-year-old manga.
posted by designbot at 8:06 AM on May 27, 2015


I can understand why you might want separate Unicode entries for characters and diacritics; I have no idea why there isn't a standard autocorrect option to convert them into the correct merged character.

I certainly feel your pain, but these days, this is mostly an implementation detail, as Unicode does provide guidelines for determining character equivalence. The guidelines are complex, and do not necessarily lend themselves to performant implementations, so it's pretty common to still see systems that struggle with this.

If you're building a search engine, it makes the most sense to normalize unicode strings when you're creating your index, which allows you to look up and compare strings by their byte values (which is much faster than running the equivalence algorithm on every single string). As long as you can guarantee that all of your input has been normalized, these "fast" string comparisons are safe.

However, things start to fall apart if you aren't careful, or are interchanging data between two different systems. If a Mac "autocorrects" a unicode filename on a Windows file-share, the Windows system is going to have a very difficult time communicating with the Mac, as the normalization typically isn't reversible. Mac OS X has historically been known for making this kind of destructive change on Windows file-shares and Git repositories. It's extremely unpleasant to deal with.

tl;dr; Normalization is powerful, but dangerous. "Autocorrecting" characters while you type or input your own data can ensure some consistency, and lets you make some performance optimizations. If you're searching or comparing data, you almost definitely want to (internally) normalize the data first. Normalizing other people's data is almost definitely a bad idea if you ever need to give that data back to them, but bad people sometimes still do this anyway.
posted by schmod at 9:34 AM on May 27, 2015 [3 favorites]


the number of programmers who understand regex and unicode is a small enough set it might as well be empty.

lolololol

We get around the block by the simple expedient of not being programmers.
posted by BrunoLatourFanclub at 2:09 PM on May 27, 2015


What's really insane is Perl seems to have the most support for Unicode and handling it correctly is still completely and utterly impossible.
posted by alex_skazat at 2:36 PM on May 27, 2015 [2 favorites]


Which makes little sense to some monolingual from the Anglosphere, but there are many words in e.g. Persian that you simply can't spell correctly without the zero-width non-joiner. So I'm here to take umbrage over your use of the word "only."

Well, you can't spell most non-native words in my mother tongue either with ZWNJ. :)

Speaking of the poop emoji in Unicode, it seems as if a lot of languages are incomplete if they're not from richer countries.

Oh boy, where do we start.

So it's complete nonsense to say that Unicode treats Indic characters the same as 'Asian' characters; Unicode's Indic character set is entirely based on the original Indian Standard Code for Information Interchange set, except for two critical differences:

1) ISCII has its own version of ZWJ's and ZWNJ's, and

2) (this is crucial) ISCII was originally meant to be a transliteration standard _between_ Indic languages, as opposed to a separate standard for all of 19 (or so) India's officially recognized scripts (there are fewer officially recognized scripts than languages, for those keeping score). Consequently, the original set essentially captured the character set for the script used for Hindi, Devnaagri, and sort of 'extended' it for all other Indic scripts. Think of it as doing the character set for Roman script first, and then 'extending' it to, say, Cyrillic. Essentially, ISCII didn't have separate character-encodings for each language; it had one character code for a proto-character, but changed the glyph depending on the language selected. Unicode obviously doesn't have that.

But here's the other bit about Unicode that you must understand. It's encoding for characters, and not glyphs. In English, there's usually no difference between a character and a glyph, but in most other languages there is; specifically in the author's mother tongue, Bengali, Unicode captures the entire alphabet, but not how you can 'mix' consonants and vowels together. That's really dependent on the rendering engine and the font.

So saying things like Bengalis were forced to make similar orthographic contortions just to write a simple email: ত + ্ +[zwnj] = ‍ৎ is quite a bit ignorant because that's precisely how it was set in the Indian-government-designed ISCII character set as well; it's not a political conspiracy to put down the brown man, it's essentially by design.

The real problem here isn't the standard, but the implementation; for instance, took Apple 10 years after Microsoft to incorporate Indian languages other than Hindi into OS X. In fact, it still doesn't have keyboards for any Indic script other than Devnaagri; which is the situation with WindowsPhone as well, despite the fact that the old Win CE supported all Indic scripts. Likewise, the problem here isn't that you'd have to type three characters to render the glyph ৎ, but that there's no physical key where you can "type" the ZWNJ character, like you can on the virtual keyboard in my Android phone.
posted by the cydonian at 6:45 PM on May 27, 2015 [9 favorites]


One of the awesomest projects I ever worked on was building Noto - a family of fonts to display all the languages encoded in Unicode. The idea behind the font is to have visual harmony between the scripts.
posted by thaths at 8:39 PM on May 27, 2015 [2 favorites]


Here's something I'd like to see in unicode alot.
posted by snuffleupagus at 11:32 PM on May 27, 2015 [1 favorite]


So there's an iOS/OS X unicode bug in the news right now.

New iPhone bug reportedly crashes users' phones with a text

iPhone Arabic Text Bug Can Flatline Apple Macs Too:
In the case of the Apple bug, a specific sequence of unicode glyphs aren’t understood by either iOS or Mac OS X. When the phone or computer can’t decide what to do, it caves and turns off.
posted by snuffleupagus at 11:39 PM on May 27, 2015 [2 favorites]


That's a really weird way to recover from an error.
posted by Chocolate Pickle at 2:40 PM on May 28, 2015 [4 favorites]




By the way, this thread was super timely for me because I've spent the last two weeks implementing the Unicode Bidirectional Algorithm which is even more complicated that you would think. Today I'm finally able to see the light at the end of the tunnel. This is what my life looks like right now. Where did I go wrong?
posted by mbrubeck at 7:48 PM on June 8, 2015 [1 favorite]


Where did I go wrong?

C++. Re-implement in Racket or Pharos.
posted by Slap*Happy at 9:41 PM on June 8, 2015 [1 favorite]


It's in Rust! Which looks a lot like C++ on the surface, but is much nicer to work with. Thankfully.
posted by mbrubeck at 10:11 PM on June 8, 2015 [1 favorite]


« Older "The map began as just a doodle."   |   Mary Ellen Mark (March 20, 1940 – May 25, 2015) Newer »


This thread has been archived and is closed to new comments