Zalgo-text would be kinder
October 23, 2015 11:34 AM Subscribe

mimic - [ab]using Unicode to create tragedy Replace a semicolon (;) with a greek question mark (;) in your friend's C# code and watch them pull their hair out over the syntax error

For more homoglyph lookup fun
Cloudflare deals with homoglyph URL attacks.
Spotify gets bitten as well in user account names.

posted by CrystalDave (83 comments total) 35 users marked this as a favorite

News you can use from dickmoves.com.
posted by The Tensor at 11:37 AM on October 23, 2015 [7 favorites]

Mean-spirited.
posted by Melismata at 11:38 AM on October 23, 2015

I thought curly quotes were awful (looking at you, WordPress), but this is truly mean.
posted by a lungful of dragon at 11:42 AM on October 23, 2015 [2 favorites]

Bastard. I have enough trouble getting my valid code to ever do anything.

Well, anything useful.
posted by rokusan at 11:48 AM on October 23, 2015 [1 favorite]

Hello Satan? Did you get the memo about Unicode we sent down?
posted by Twain Device at 11:52 AM on October 23, 2015 [4 favorites]

my very first thought was that there's probably something for vim to detect this sort of textual terrorism, and, yep, there it is, right in the readme:

vim-troll-stopper: vim plugin that alerts you by highlighting "troll" Unicode characters in red.

posted by You Can't Tip a Buick at 11:52 AM on October 23, 2015 [41 favorites]

Also try: remapping your C programmer friend's VT-220 to Danish (which you can do remotely and invisibly by embedding the right control codes in a file you get them to display) . If they're a Vi user, they'll have a lot of macros in muscle memory and they'll go wildly wrong - because all those wrong characters will do something. Just they'll have no idea what.

I wasn't Mr Popular that day.
posted by Devonian at 11:53 AM on October 23, 2015 [25 favorites]

Joke's on you - I hacked your vim-troll-stopper so it doesn't alert you. bwahahaha..
posted by symbioid at 11:54 AM on October 23, 2015 [1 favorite]

In C code that would be an annoying half hour, C++ couple days, now just a single one, implanted in just the right spot, in properly obfuscated and obscurely refactored Lisp, woohoo couple generations of dev staff sent to the loony bin before it was found!
posted by sammyo at 11:55 AM on October 23, 2015 [11 favorites]

Didn't someone use something like this to create a fake quonsar account back in the day?
posted by Navelgazer at 11:56 AM on October 23, 2015

Because what developers need are more bugs.
posted by swift at 12:08 PM on October 23, 2015 [2 favorites]

Thanks, Obama!
posted by amtho at 12:09 PM on October 23, 2015 [2 favorites]

What a fascinating way to get murdered.
posted by Artw at 12:11 PM on October 23, 2015 [64 favorites]

I work around this by printing all my code to PDF and OCR'ing it before compilation. If it looks like a semicolon then it's a semicolon, I say.
posted by phooky at 12:15 PM on October 23, 2015 [22 favorites]

they should have sent a poet
posted by griphus at 12:16 PM on October 23, 2015 [20 favorites]

And people wonder why I don't use a program editor that supports unicode.
posted by Bringer Tom at 12:16 PM on October 23, 2015 [2 favorites]

I do something like this. Unicode NO-BREAK SPACE U+00A0 is just the high bit set on an ordinary space U+0020. Most Unicode-capable editors will accept UTF8 and display this character without highlighting, as do most in-browser debuggers.

So I use the high bit on all non-indentation spaces to encode a watermark in Javascript code that I distribute to students. This approach is robust against re-indentation of code as well as global search-and-replace of symbols, plus a stop-bit parity-bit encoding with redundancy makes it resistant to editing.

Helps me track who gives code to whom. I call it a whitemark.
posted by rlk at 12:21 PM on October 23, 2015 [71 favorites]

The next level is to create trolly YACC scripts and sub the custom GCC into the buildchain.
posted by bonehead at 12:25 PM on October 23, 2015 [1 favorite]

So I use the high bit on all non-indentation spaces to encode a watermark in Javascript code that I distribute to students.

That's fascinating - do you literally encode information in the sequence of no break spaces or do you treat it like a fingerprint if they exist at all?
posted by Kikujiro's Summer at 12:27 PM on October 23, 2015

I've managed to do this to myself more than once, probably by hitting the compose key.
posted by joeyh at 12:31 PM on October 23, 2015

What a fascinating way to get murdered.

If Unicode had existed when i was in my early-twenties and at the IT department of [redacted] working with other twenty-somethings, these sort of shenanigans would have happened — probably leading to a notorious, near-blows incident.
posted by D.C. at 12:38 PM on October 23, 2015

Helps me track who gives code to whom. I call it a whitemark.
rlk

That reminds me of how cartographers will intentionally put errors or false features into their maps to catch copycats and infringers.
posted by Sangermaine at 12:44 PM on October 23, 2015 [6 favorites]

That's fascinating - do you literally encode information in the sequence of no break spaces or do you treat it like a fingerprint if they exist at all?

I generally encode the email address of the person that I send it to.
posted by rlk at 12:44 PM on October 23, 2015 [12 favorites]

Thanks, ~~Obama~~ 0bama.
posted by rokusan at 12:57 PM on October 23, 2015 [20 favorites]

I've had this happen with unicode whitespace on some code I was working on. It was some copy/paste error from a word doc or something. That was a bad day.
posted by The Power Nap at 1:04 PM on October 23, 2015 [3 favorites]

rlk: I was going to ask if you've thought of a secret backup detection plan to detect students who read metafilter and use vim, but then I realized all two of them are probably doing pretty well in your class.
posted by You Can't Tip a Buick at 1:06 PM on October 23, 2015 [20 favorites]

rlk, please tell me you have a python script or something for encoding a string and implanting/extracting it in a given block of text in the way you describe.

Also-please to forward.
posted by rokusan at 1:14 PM on October 23, 2015

I've been screwed like this by dashes pasted out of a MS Word document, which are a different char code than what you get typing a "-" into a text file, and by the abovementioned non-breaking space, which won't even regex as a whitespace character in some languages, along with its archaic and seldom seen relatives like the form-feed character.
posted by Vulgar Euphemism at 1:17 PM on October 23, 2015 [2 favorites]

Lovely. I've used a similar trick to get bold and italics in Twitter (for which a friend called me an "obnoxious superhuman", which pleased me greatly) by using Mathematical bold and italic variants.

rlk's ruse is a sneaky version of Damian Conway's Perl module Acme::Bleach. It converts your code to spaces and tabs, yet it still runs. It's up there with requiring your code's config files to end with '~' as the last line just to annoy vi users.
posted by scruss at 1:23 PM on October 23, 2015 [8 favorites]

If your code editor displays the article text as

semicolon (;) with a greek question mark (☒)

then you may have already been bitten by this in another form and adjusted your workflow accordingly. ;)
posted by introp at 1:27 PM on October 23, 2015 [4 favorites]

rlk, please tell me you have a python script or something for encoding a string and implanting/extracting it in a given block of text in the way you describe.

La.
posted by rlk at 1:28 PM on October 23, 2015 [30 favorites]

Man, and I thought changing their keyboard layout to DVORAK was evil.
posted by Ghostride The Whip at 1:31 PM on October 23, 2015 [1 favorite]

*cough* [:space:] *cough*
posted by ethansr at 1:40 PM on October 23, 2015

You know that feeling you get in the pit of your stomach when you go over a dip in the road at speed (a guy I know used to call it a 'sweet belly')?

I got that feeling when I read this. *shudder*
posted by Mooski at 1:53 PM on October 23, 2015 [4 favorites]

The first developer I worked with never ever copy-pasted anything. He had his list of reasons. Another reason to add to his list.
posted by clawsoon at 1:58 PM on October 23, 2015 [1 favorite]

Anglo privilege:
[^ -~]
posted by [expletive deleted] at 2:14 PM on October 23, 2015 [1 favorite]

I did not know there were so many different kinds of question marks.
posted by Ratio at 2:22 PM on October 23, 2015

La.

issynced doesn't seem to be used anywhere?
posted by kenko at 2:24 PM on October 23, 2015

Ｆｕｌｌｗｉｄｔｈｉｓｂｅｓｔｗｉｄｔｈ
posted by clorox at 3:05 PM on October 23, 2015 [2 favorites]

Full width is fun for when you want to sound like everything is fine but suddａｌｌｈａｉｌｔｈｅｇｌｏｗｃｌｏｕｄ
posted by DoctorFedora at 3:09 PM on October 23, 2015 [16 favorites]

Does it mean I'm officially old, if this sounds so annoying and mean that it doesn't really even amuse me? I have no relationship to coding, but the thought of something that someone's been putting work into being fucked with in such an irritating way is making me weirdly pissed off.
posted by threeants at 3:17 PM on October 23, 2015 [2 favorites]

The fault, dear Mefites, lies not in the Unicode but in yourselves. Feel your privilege. Because if in 2015 you, your code or your program editor do not support Unicode and you do not feel a pinch, you are living in a very privileged bubble indeed.
posted by Autumn Leaf at 3:22 PM on October 23, 2015

This programming language which requires Unicode characters, please show it to me. Even Lua is 7-bit.

You do, of course, need Unicode resources to build user interfaces but it's been a dumb idea to put those in code for at least 10 years. You do need a Unicode capable editor but it is absolutely the wrong tool for editing code for exactly the reason we are seeing here.

Unicode was designed from the ground up to render a wide range of symbols which would look right to a diverse array of human eyes. Computer programs are written for computers, which want simplicity and consistency. A program which cannot survive being pasted into Notepad is not a program you want to be maintaining.
posted by Bringer Tom at 3:39 PM on October 23, 2015 [4 favorites]

The classic dirty trick to pull on a new unix user is to find their terminal logged in and execute this command:

touch \*

That's all. Just walk away, and listen for the screams.

I learned about that (the hard way) in the late 1970's. (Gad, I'm old.)
posted by Chocolate Pickle at 3:59 PM on October 23, 2015 [3 favorites]

Reminds me of Things to commit just before leaving your job. The "#define volatile" is just plain mean.
posted by changoperezoso at 4:04 PM on October 23, 2015 [3 favorites]

Bringer Tom: A program which cannot survive being pasted into Notepad is not a program you want to be maintaining.

Is this the same Notepad that, for a while, saved everything as UTF-16 by default? The same one that insists on polluting files with a BOM at the beginning?

One of the nice things about using Ruby as my preferred programming language is that, being invented in Japan, it's treated ASCII and Shift-JIS as full equals from the beginning, so adding Unicode support was relatively easy (emphasize “relatively”) but also that having a singular One True Encoding for the source code was never a consideration. Japanese people want to be able to write comments in Japanese, who knew?

We've reached the point where I can type ≠ just as easily as I can type !=, and have a pretty reasonable trust that people will be able to read that. And yet we have to use programming languages tied to an old telegraphy-based standard‽ Wouldn't it be lovely to be able to “nest “quoted” sections because “quotation marks” were grouping characters” just (like (parentheses) are) instead of "typewriter style"? Wouldn't it be nice to be able to call a bit of code verrückt if you thought it was?
posted by traveler_ at 4:19 PM on October 23, 2015 [6 favorites]

AutoCAD allows users to tag every single object drawn with arbitrary text. This can be configured to be done automatically and the arbitrary text can be set programmatically. So if you control the installation you can set it to tag the objects with say a student's windows username.

And then if you are clever you can write a little script that can examine a drawing and display

the username of everyone who worked on a file
how many of the drawing objects and what percentage of the drawing each person contributed
how much time each person spent
move each person's contribution to it's own layer and then colour code the layers

All of which makes it really easy to tell when one student has "helped" another student to complete an assignment.
posted by Mitheral at 4:24 PM on October 23, 2015 [5 favorites]

This programming language which requires Unicode characters, please show it to me.

Java.
posted by stebulus at 4:25 PM on October 23, 2015

Java doesn't require Unicode... it uses Unicode as its native string encoding, but you can certainly write perfectly cromulent Java source that's pure ASCII.
posted by axiom at 4:33 PM on October 23, 2015

A program which cannot survive being pasted into Notepad is not a program you want to be maintaining.

Notepad is not a program you want to be pasting into, as noted above. Besides, being able to write math routines with e.g. λ and δ instead of lambda and delta is one of the finer joys in life. I don't use any software that deals with text in any way, if it doesn't support Unicode. You can't call your program a text editor if it doesn't edit text that isn't in English. Sorry, TextMate circa whenever.
posted by hyperbolic at 4:34 PM on October 23, 2015 [1 favorite]

Java doesn't require Unicode... it uses Unicode as its native string encoding, but you can certainly write perfectly cromulent Java source that's pure ASCII.

It sounds like you're interpreting "Java doesn't require X" to mean "there exist valid Java programs that don't require X", or perhaps "the class of Java programs that don't use X is as powerful as the whole class". Under such interpretations, Java doesn't require the character '&', or for loops, or non-static methods. I interpret "Java doesn't require X" to mean "Java tools do not have to support X to be correct", and that is false. The language spec says Java source is Unicode.

Less academically: You can of course write in whatever subset of Java pleases you, but if you're hoping to make use of others' code, your tools had better support Unicode source.
posted by stebulus at 4:52 PM on October 23, 2015 [1 favorite]

Swift, to a great approximation of literally, requires that all developers at some point write:

for character in "🐶🐮".characters {
    print("\(character): Moof!")
}

posted by ~ at 5:07 PM on October 23, 2015 [10 favorites]

Is this the same Notepad that, for a while, saved everything as UTF-16 by default? The same one that insists on polluting files with a BOM at the beginning?

I don't know where you get your copies of Windows but none of the dozens of computers I use does this. Of course I haven't done much programming work in Win 7 or 8, so maybe that's what I'm missing, in which case it's will be time to go editor hunting again. My most used workstation is an XP box which was built in 2003, and I keep it because of the 16-port PCI serial card. It will be a major PITA when it finally dies.

If you are doing math that requires symbols that aren't on a standard keyboard you aren't doing programming, you're doing math, which is really a different thing. And it's very likely that you're doing things with floats you should be doing with integers. That's a different rant though.

Allowing unicode characters in string literals is not the same as requiring unicode characters. The runtime may require unicode but the compiler absolutely does not and if you put your unicode strings in resource files, which makes a lot more sense anyway so you can implement language packs, you don't need them in the source.

In fact my rants kind of collide, because the reason unicode is bad for programming is pretty much the same reason floats are bad for computer math unless your problem absolutely cannot be tackled without them; they introduce ambiguity and imprecision where those things could have been avoided. Use integer math and ASCII unless you absolutely have no alternative. And when you do use floats or unicode, know your limitations and have tools to figure out where things went south, because you'll need them.
posted by Bringer Tom at 5:14 PM on October 23, 2015 [2 favorites]

You're speaking from an incredibly privileged position, demanding that everyone in the world who uses a non-Latin alphabet had better learn to do everything in English with a Latin character set before they attempt to program.

We're finally getting close to the point where the vast majority of people on earth who need non-Latin characters in their language are about to actually be able to use words from their own written language as variable and function names and as strings without jumping through ridiculous hoops, but Bringer Tom is here pretty much demanding we go back to the pure old days of 6-bit punch cards.
posted by Jimbob at 5:49 PM on October 23, 2015 [8 favorites]

Bringer Tom: "This programming language which requires Unicode characters, please show it to me."

Welcome to the wonderful world of Perl 6. Yes, each of these Unicode operators have ascii substitutes that, as a programmer, you're free to use. But, the Perl 6 parser itself does appear to be required to handle Unicode.
posted by mhum at 5:50 PM on October 23, 2015

Bringer Tom: "If you are doing math that requires symbols that aren't on a standard keyboard you aren't doing programming, you're doing math, which is really a different thing."

Also, my friend who was doing software development in APL for an actuarial firm may beg to differ.
posted by mhum at 5:53 PM on October 23, 2015

7-bit ASCII 4 Life.
posted by benzenedream at 6:01 PM on October 23, 2015

I don't know how universally true this is, but a trick I've learned lately when cutting and pasting is to do control-shift-V instead of control-V. It kills fancy formatting, and it's my new best friend.
posted by Devonian at 6:03 PM on October 23, 2015 [14 favorites]

Besides, being able to write math routines with e.g. λ and δ instead of lambda and delta is one of the finer joys in life.

Oh yay, anything we can do to encourage subject matter experts to crap out code with even terser variable names is a great pleasure. BTW, what does lambda mean in their specialized subfield? Is it the lambda you're familiar with, or is it something slightly different? Did they use a textbook with a completely different notation convention for undergrad? Good luck with that, they left the company two years ago.
posted by indubitable at 6:30 PM on October 23, 2015 [4 favorites]

→ This programming language which requires Unicode characters, please show it to me

HeartForth! A stack-based language that uses emoji for its keywords and operators.
posted by scruss at 6:59 PM on October 23, 2015 [1 favorite]

You're speaking from an incredibly privileged position, demanding that everyone in the world who uses a non-Latin alphabet had better learn to do everything in English with a Latin character set before they attempt to program.

Well I recognize that that's a problem if you grew up speaking Japanese. You do remember how the Japanese solved that one, don't you? That was the correct solution.
posted by Bringer Tom at 7:37 PM on October 23, 2015 [1 favorite]

And to elaborate, if it was the Japanese who invented programming and I had to learn their language to do it, I'd be OK with it if I only had to learn a handful of symbols; most programming languages have less than a hundred "words" and very fixed simple syntax compared to human language. I'd be especially OK with it if English required thousands of easily confused symbols importing things like emotion that don't matter to a computer.
posted by Bringer Tom at 7:42 PM on October 23, 2015 [2 favorites]

Note that it has a --reverse switch, which is probably a hell of a lot easier and gets more things right than mucking about with iconv (which I have done).

For better or worse, Unicode is upon us, and having good tools for downconverting to the closest ASCII equivalent is going to be a huge part of our migration path.
posted by straw at 7:45 PM on October 23, 2015

Oh and one more: One of my platforms happens to be a complex of embedded systems distributed by a company that is now international, the US manufacturer having been bought by a multinational about 10 years ago. They have operations in every industrial country on Earth, and they are moving from a pidgin QBasic/Visual Basic hybrid to Lua in their new stuff. And I recently got a doc dump of their programming guidelines, and after #2 use camelCase guideline #3 is "use English variable names." French is specifically offered as an example of How Not To Do It. FWIW.
posted by Bringer Tom at 7:48 PM on October 23, 2015 [1 favorite]

And to elaborate, if it was the Japanese who invented programming and I had to learn their language to do it, I'd be OK with it if I only had to learn a handful of symbols; most programming languages have less than a hundred "words" and very fixed simple syntax compared to human language.

Oh man, this is making me feel all nostalgic for the periodic whining on ruby-lang mailing lists about how there's core development discussion that happens in Japanese and isn't always translated to English.

Wait, is nostalgic the word I'm looking for here? Hmm.
posted by spaceman_spiff at 7:56 PM on October 23, 2015 [4 favorites]

Didn't someone use something like this to create a fake quonsar account back in the day?

Sort of
posted by TedW at 8:02 PM on October 23, 2015

If you want people to use a single syntax/set of keywords, something based on Latin characters, at least, is probably the only sensible default. Not only is it the character set used by all non-trivial programming languages so far (and languages are not created in a vacuum), and the character set used by the inventors of computer programming, it's also the most widely used by far, some 36% of the world's population uses Latin characters. The runner-up is is Chinese, but that's only 18% of the world's population, and the vast majority of those people are concentrated in a single country.

What language the keywords are based on and so on, however, that's a bit trickier, of languages that use Latin characters, English and Spanish are the biggest, and about the same size by number of native speakers, but English is far bigger by number of total speakers as a first or second language, around 850 million people in total, and the set of keywords you need to learn is typically quite small (maybe a hundred at the very most), and not dependent on even knowing with a lot of nuance the general meaning of those words.

Finally, names of variables, functions, objects, and whatnot. Use whatever you want there. If your code is likely to be distributed somehow, read by people whose native language is different from yours, or you just want it to be more "standard", English is still the lingua franca, and that's unlikely to change, but it's no big deal.

I've seen a lot of quite broken English in identifiers and comments from people who are clearly not very proficient in English, but very good at programming, so I don't think programming languages generally being based on English is a very high barrier to entry in most cases. I'm not a native English speaker, but I learned to program in English using English resources and documentation, so I tend to think in English when I program. I would assume this is the case for a lot of people.
posted by Joakim Ziegler at 8:29 PM on October 23, 2015 [1 favorite]

Behold privilege at work. I rest my case.

If the Japanese had invented programming you would still only have to learn a handful of symbols, because back then every byte counted. If the Japanese had invented programming they would have faced the same early memory limitations. In fact the earliest computers did not use human-like languages at all: they were programmed in binary by setting switches. Although Assembler is conventionally rendered in English-like forms ("mov edx,len") this has no bearing on the underlying machine code. You could easily write a compiler to convert "foo zztop,bar" into the same instruction.

Japanese or Chinese might even have been better starting points than English for programming, because even using two bytes per character, by choosing your words carefully you might only need 2 to 4 bytes for a keyword (example: PRINT - 刷る).

This thread is about unicode (specifically the mayhem created when unicode characters appear in unexpected places), but the privilege extends much further. Programing langages are still in daily use that barf when asked to handle raw dates expressed as anything but "mm/dd/yy", choke when "," is the decimal marker, and insist that keywords must be selected from a specific subset of English. It's all translated (compiled) into machine code - so why do we still have programming languages that cannot gracefully handle regionalisation?

Regarding mayhem, I created a new project in C# and pasted an ερωτηματικό into the code window, overwriting the semicolon in "InitializeComponent();". It immediately red-underlined the ερωτηματικό and the description of the error was "unexpected character ';'". Under normal circumstances I might not know why ";" was unexpected there but my first instinct would be to simply delete the offending character and retype it ";" - which indeed corrected the error. So mayhem may not be quite the killer you think it is.
posted by Autumn Leaf at 9:17 PM on October 23, 2015 [1 favorite]

but the privilege extends much further. Programing langages are still in daily use that barf when asked to handle raw dates expressed as anything but "mm/dd/yy"

This is not a programming language problem, it is an environment and library problem.
posted by Bringer Tom at 9:25 PM on October 23, 2015 [1 favorite]

And of course dates are their own hell. I only figured out today that the Jira API was giving me bad dates in October because parts of the system are in the current Israeli time zone and parts of the system are in the pre July 2013 Israeli time zone. Or something like that. Suffice that the time zone in Israel is still +0300 but Jira tells me it's in +0200, and that it's bedeviled me for years.
posted by wotsac at 9:34 PM on October 23, 2015 [1 favorite]

AAAAARGH NOT DARETIME ISSUES NOOOOO!
posted by Artw at 10:10 PM on October 23, 2015 [1 favorite]

Bringer Tom, I think your own frustrations with your own particular programming experiences have led you into an idiosyncratic definition of "requiring unicode characters"—one that draws a bright line between unicode in string literals and… where else, exactly? Which level of unicode-iness are we talking about here? Non-ascii characters in mandatory keywords/operators? In optional keywords/operators? In parameter/variable names? In literals? In comments?

Because sometimes Björk Guðmundsdóttir would like her @author field to properly spell her name in the Javadoc. Sometimes I want to be able to throw together a quick script scanning through a bunch of files and be able to use characters like « and » in my regex literals. No separate "language pack" for a one-file script.

"You do remember how the Japanese solved that one, don't you?"

Yes, as I mentioned, they wrote a language that would equally accept Shift-JIS encoding as well as ASCII, and wrote comments in Japanese. If you're saying the "correct solution" is "everybody just use English for everything"—uffda, I'll counter with "everybody write their comments in Lojban". Or at least Esperanto, which also requires Unicode!

"BTW, what does lambda mean in their specialized subfield?"

I for one would consider this code more readable:

...
#  λ  measured wavelength, meters
#  L  slit width, meters
#  R  pattern distance, meters
...
return L / (λ * R)

Versus this “self-documenting best practices” code:

...
return slitWidthInMeters / (measuredWavelengthInMeters * patternDistanceInMeters)

The moment a calculation goes beyond one or two operators the value of CamelCaseFactoryFactoryObserverMethodDelegate names becomes negative. What is λ? Well, what is int counter? Nothing about writing understandable code has changed by introducing a larger alphabet for lexemes.

"If you are doing math that requires symbols that aren't on a standard keyboard you aren't doing programming, you're doing math, which is really a different thing."

One of the earliest programming languages was literally a FORmula TRANslator. If anything, programming that's not math in at least the coder's head is what's not doing programming. (End Dijkstra mode.)

"And when you do use floats or unicode, know your limitations and have tools to figure out where things went south, because you'll need them."

I actually agree, although when I need them it'll be because somebodyâ€™s been creating æ–‡å—åŒ–ã‘ because they used a non-Unicode-aware editor. We’ve lived in a world with Unicode for quite some time now. Most editors support it (and mine even highlights NBSP characters like these with a little purple underline so rlk’s neat trick would be really obvious in it). Making ASCII-only assumptions is almost universally unacceptable. These days I’m dropping occasional non-ASCII (and even non-BMP) characters to act as fail-fast canaries because if they do end up causing problems, somebody's probably broken something else too. And “I thought I could get away with ‘just use ASCII’ in this type of file” is often what broke it.

(The version of Notepad that was saving everything as UTF-16 was on my boss's NT 200? machine, some years back, which I discovered when he started using it to write input files for my ASCII-assuming program. This was in the early days of Unicode-ification when Microsoft (and Sun) assumed 16-bit was where everything was going and started migrating their systems that way. The version of Notepad that prepends a BOM by default is I think every version since then, which uses UTF-8 by default and managed to embrace-and-extend the Consortium by turning an unrecommended practice into a de facto standard.)
posted by traveler_ at 11:20 PM on October 23, 2015 [11 favorites]

Because sometimes Björk Guðmundsdóttir would like her @author field to properly spell her name in the Javadoc

She's sort of an interesting edge case though, since although she's an undeniably brilliant programmer, she's remarkably difficult to work with on free software projects. mainly because she just doesn't have as much time to devote to free software development as she thinks she does, especially when she's on tour.
posted by You Can't Tip a Buick at 7:12 AM on October 24, 2015 [1 favorite]

If anything, programming that's not math in at least the coder's head is what's not doing programming.

This is intensely frustrating in any scientific code, btw. It's subotimal for all the reasons traveler_ mentions, harder to read, less clear to the subject matter experts who need to at least vet the code, and harder to teach to students.

I realize scientific and mathematical computing is something of an edge case, but the enforcement of ASCII on it has been a frustration of mine and many of my colleagues for decades. It would be very nice to see that change.
posted by bonehead at 7:27 AM on October 24, 2015 [1 favorite]

This programming language which requires Unicode characters, please show it to me.

Here you go.
posted by kenko at 11:06 AM on October 24, 2015

The classic dirty trick to pull on a new unix user is to find their terminal logged in and execute this command:

touch \*

Ok can anyone elucidate me as to what this actually does? I don't want to try it, obviously, and the symbols make it hard to google. I'm sort of guessing it might try to create a whole bunch of empty files, but if so I'm curious as to the naming scheme.
posted by iotic at 11:54 AM on October 24, 2015

> touch \*
>Ok can anyone elucidate me as to what this actually does?

It creates an empty file named *, which would lead someone to naively attempt to remove it with "rm *", deleting all the files within that directory.
posted by Condroidulations! at 12:07 PM on October 24, 2015 [5 favorites]

Now I'm wondering how the heck would you get rid of that file? Enclose the * in quotes?
posted by Mitheral at 12:19 PM on October 24, 2015

Aha, thanks - yeah of course. Yeah that's pretty mean :)
posted by iotic at 12:58 PM on October 24, 2015

Same way you made it, rm \*
posted by iotic at 1:00 PM on October 24, 2015 [1 favorite]

If you code in ruby, you can have fun like this with your code reviewer:

irb(main):008:0> 🐶 = [1, 2, 3]
=> [1, 2, 3]
irb(main):009:0> 🐶.each { |🐮| puts 🐮 }
1
2
3

posted by double block and bleed at 4:45 PM on October 24, 2015

Autumn Leaf: "why do we still have programming languages that cannot gracefully handle regionalisation?"

I'm not sure what you're asking here. Are you asking why legacy code still exists and has to be maintained? Because if you are, I suspect you might not have had much to do with actual practical computer engineering.

Or are you asking why newer programming languages that are still maintained, developed, and updated don't handle regionalization? Because then you're wrong, they pretty much all do, in different ways and with differing levels of gracefulness, depending on their set parameters, feature set (and the desire/requirement to not break existing code unless absolutely necessary).

I mean, yeah, date and time format handling is a pain, but it's not a pain because programming languages generally do it wrong, it's a pain because date formats in general are a pain, vary wildly, depend on a huge number of underlying assumptions about where different users might be, where the system is running, and everything from leap seconds to oddball and changing timezones, from daylight savings to that time in the 1500s when they just dropped a few days from the calendar, and so on. This is not a problem with programming languages, this is a problem with the reality they're trying to handle. It's hard. It's why pretty much all programming languages have well-developed libraries to work with this stuff, and when you decide to not use those and do it yourself, you can just assume you are wrong.

Unicode is the same way, really. It's trying to describe a hugely complicated set of data in a standardized way, with the addition of some problem domain definition problems and some engineers who think they know better than the people who actually use the languages, and so on, but it's basically the only attempt at doing what it does that has ever gotten anywhere, so it's the best we've got (and using it has become less of a pain).

I'm not sure how my comment about what character sets and languages the majority of the world's population use is "privilege at work", though. You're right, if the Japanese had invented programming, it's quite likely we'd be programming in Japanese, and that'd be fine too, although it'd quite factually inconvenience a lot more people worldwide than the actual status quo does. And assembly language is indeed very tiny and not really tied to English, although you could argue that most higher-level languages are to a much larger degree.

I'm just not sure if there's a good way to make something neutral (there have been attempts at languages that use symbols instead of keywords, but they turn out to be hard to remember for everyone, hard to talk about both in speech and in emails and forums, and generally haven't caught on), or something that can be localized as a language, which is doable, but AFAIK no one has really done.

Localizing the keywords would probably be quite doable and uncomplicated, and you could auto-translate code from one language to another with few problems, but modern languages have huge standard libraries that would be a lot of work to localize, and where automated translation would potentially introduce conflicts and ambiguities. And that's before you get down to comment strings and documentation, which are increasingly popular to do inline as part of the source code.
posted by Joakim Ziegler at 1:15 AM on October 25, 2015 [1 favorite]

Why is Swift's String API So Hard? - basically they try to do Unicode right, and it sounds horrible to work with.
posted by Artw at 9:56 AM on November 7, 2015 [1 favorite]

« Older “She was brave enough to talk with me on tape... | Paint Stripper Newer »

This thread has been archived and is closed to new comments

MetaFilter

Zalgo-text would be kinder
October 23, 2015 11:34 AM Subscribe

Tags

Share

Zalgo-text would be kinder October 23, 2015 11:34 AM Subscribe

Tags

Share

Zalgo-text would be kinder
October 23, 2015 11:34 AM Subscribe