Falsehoods Programmers Believe About Plain Text
January 8, 2024 5:31 AM Subscribe

Falsehoods Programmers Believe About Plain Text All of these assumptions are wrong. Top 5: Technical, Non Technical and Other - Full list = 77 items

Top 5: Technical, Non Technical and Other - Full list = 77 items

1. The Latin alphabet has 26 letters.
2. Ignoring case, the Latin alphabet has 26 letters.
3. Ignoring case and accents, the Latin alphabet has 26 letters.
4. Yes, but ignoring variants, the Latin alphabet has 26 base letters.
5. Seriously, ignoring variants and combinations, the Latin alphabet has 26 base letters.

28. Characters are bytes (or ASCII + code page)
29. Characters are two bytes (or UTF-16 code units).
30. Characters are integers (or Unicode code points).
31. Characters are the basic parts of a writing system (or graphemes).
32. Characters in ::insert programming language:: are, ::one of the above::

71. [a-zA-Z] will match any letter.
72. [0-9] will match any numeral.
73. [ \t\n\r] or \s will match any whitespace character.
74. \p{L} or \p{Letter} will match any letter.
75. \p{Lu} or \p{Uppercase_Letter} will match any uppercase character.

Any obvious missing ones?

posted by Faintdreams (99 comments total) 41 users marked this as a favorite

Really wish people building password complexity checks would recognize that “special character” does not mean “any one of @#$&*!, but ONLY those characters”
posted by caution live frogs at 5:52 AM on January 8 [25 favorites]

Really wish people building password complexity checks would recognize that “special character” does not mean “any one of @#$&*!, but ONLY those characters”

I've lost count of the times I've allowed my browser to generate a "secure password" for me, only to have the website shriek and berate me because the password contains a character it doesn't like. (Or, in many cases, because the password is too long.)
posted by Faint of Butt at 5:59 AM on January 8 [26 favorites]

Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small electrical charges, which is odd, because on the whole, it wasn't the small electrical charges which were unhappy. And so the problem remained, and lots of the people were mean, and most of them were miserable, even the ones with native Unicode support. Many were increasingly of the opinion that they'd all made a big mistake moving from ASCII to Unicode in the first place, and some said that even ASCII had been a bad move, and that no-one should ever have invented writing.
posted by jedicus at 6:00 AM on January 8 [107 favorites]

The list is very clever. Not helpful. Just very clever.
posted by Cardinal Fang at 6:08 AM on January 8 [32 favorites]

As a programmer, I say "fuck this noise."
posted by Aardvark Cheeselog at 6:10 AM on January 8 [8 favorites]

Pedant fight? OK.

I think you’ll find the Latin alphabet had 23 letters, and no lowercase or other barbarian stuff. I think you’ll also find that you’re talking broadly about character sets rather than alphabets. And I think programmers are well aware of all this shit, thank you very much.
posted by Phanx at 6:11 AM on January 8 [19 favorites]

Now do calendars.
posted by CheeseDigestsAll at 6:15 AM on January 8 [18 favorites]

28. Characters are bytes (or ASCII + code page)

No, characters are 5 bits, and special characters (space, CR, LF) are symmetrical so that if you load the tape upside down formatting is preserved.

this is my favorite trivia about binary encoding of text and I'm sad that this didn't continue into ASCII and beyond
posted by AzraelBrown at 6:15 AM on January 8 [26 favorites]

I think you’ll find the Latin alphabet had 23 letters, and no lowercase or other barbarian stuff.

Technically the entire Latin Alphabet is barbarian stuff. It’s stolen from the Etruscians who stole
It in turn from the Greeks.
posted by jmauro at 6:15 AM on January 8 [9 favorites]

Or, in many cases, because the password is too long

Indeed. The OWASP recommendations note that some hashing algorithms work best with a limited input or have an actual input limit (e.g. 64 bytes with PBKDF2 using SHA-256; 72 bytes if you're stuck using bcrypt in a legacy system). But those limits are all much higher than most password fields I've seen, which are usually limited to ~8/12/16 characters.

The list is very clever. Not helpful. Just very clever.,

If you click "Show/Hide counter-examples and discussion." there's some helpful information.

And I think programmers are well aware of all this shit, thank you very much.

I've got bachelor's and master's degrees in computer science and have been developing software part or full-time since the late 90s, often working with natural language processing, and some of this was news to me or still good to be explicitly reminded about from time to time.
posted by jedicus at 6:15 AM on January 8 [32 favorites]

Assuming people don't automatically know things that you yourself know and being willing to explain these things even against the foolishly proud resistance of a child-like mind is among the most helpful of skills in IT, or indeed any technical field.
posted by seanmpuckett at 6:25 AM on January 8 [36 favorites]

I've got bachelor's and master's degrees in computer science and have been developing software part or full-time since the late 90s, often working with natural language processing, and some of this was news to me or still good to be explicitly reminded about from time to time.

Especially if it's a question in a job interview; it reminds me that it's time to terminate the interview, run out of the room screaming and waving my arms around, and look for a job somewhere else.
posted by Cardinal Fang at 6:26 AM on January 8 [5 favorites]

Ligatures are never used as distinct letters in an alphabet.
Digraphs are never used as distinct letters in an alphabet.
Trigraphs are never used as distinct letters in an alphabet.

I can guarantee you that none of these are assumptions that most programmers make about alphabets, as most programmers have no idea what the hell these are.

Same goes for a lot of the rest of the list. Fun though.
posted by Tell Me No Lies at 6:27 AM on January 8 [11 favorites]

I would love to see contra-examples for all of these.

[edited to add] OK, now I see the contra-example link at the bottom.
posted by adamrice at 6:31 AM on January 8

what do they mean by plain text?

To me, that's the core assumption that's being tested here. "Plain text" ain't plain and ain't always text either. Like handling dates, there's more sublty to it than even people who think they understand it realise.
posted by bonehead at 6:32 AM on January 8 [3 favorites]

Are they are being ambiguous on purpose or by accident? ie, what do they mean by plain text? Which particular variety of ASCII or Unicode are we talking about? Is the point that some programmers don't even know to ask that question?

I would love to see contra-examples for all of these.

and similar comments: see

If you click "Show/Hide counter-examples and discussion." there's some helpful information.

It was quite interesting and I learned a bit about different writing systems from around the world. Plus the idea that mathematical symbols are a grapheme writing system.
posted by eviemath at 6:35 AM on January 8 [4 favorites]

Ah, I see the edit substantively changed one of the comments I quoted.
posted by eviemath at 6:36 AM on January 8

Oh, clicking the "show counterexamples" makes the list much more interesting. I missed that on the first read.
posted by fantabulous timewaster at 6:36 AM on January 8 [4 favorites]

I can guarantee you that none of these are assumptions that most programmers make about alphabets, as most programmers have no idea what the hell these are.

It's a conditional probability thing.

Given that (you are a programmer) and (you are aware what digraphs are), you are most likely to write code that assumes ([they] are never used as distinct letters in an alphabet.).

These lists are checklists of the incorrect assumptions you'll get when touching some programmer's text processing code, often your own from your past. The layers of wrongness you get is amazing: even after someone reaches the point of understanding code points and digraphs and the like.
posted by NotAYakk at 6:40 AM on January 8 [2 favorites]

CheeseDigestsAll: Now do calendars.

Close: Falsehoods programmers believe about time
posted by syzygy at 6:42 AM on January 8 [14 favorites]

I was just telling my mother the other day about the names and time lists like this so it is random that this came up now.
posted by jacquilynne at 6:43 AM on January 8

Now do calendars

Falsehoods programmers believe about time

(as some may know, "falsehoods programmers believe about X" is kind of a memetic blog post title for comments on the difficulties of interfacing with that pesky and messy Real World; I thiiiiink but am not sure the first was names, there are quite a few snowclones out there about networks, math, ... )
posted by look upon my works progress administration at 6:44 AM on January 8 [12 favorites]

Falsehoods programmers believe about names
posted by scolbath at 6:46 AM on January 8 [13 favorites]

The number of times I have filled out a form and seen an error like "please enter your name using letters only" and thought 'ugh, do they not know about hyphens, fine, I'll try without it' and just gotten the same error again, thinking 'maybe it's the space in the surname field?' only to find out that actually the website is a racist and will quite happily accept the hyphen as a letter(???) as long as I don't use the vowels actually in my fucking name.
posted by Dysk at 6:46 AM on January 8 [10 favorites]

78. Emoji are not text.
79. Every writing system is in unicode.
80. All communications have a written form.
81. Markup is not part of text.

Look. we can go on, but essentially text is limitation. When we used to talk about "plain text" we meant ASCII, and we understood that it was limiting. We ran into the walls all the time! This list appears to stem from a misconception that unicode can represent anything. It doesn't! There is meaning commonly conveyed in text that is dismissed as "markup" and left out of encodings as if it weren't part of the communication. Take italics! It's part of the information. I'm using it right now.

"Plain text" is a meaningless phrase if you're not defining it. There is no all-encompassing way of recording text. You have to define what you're capturing, and what you're excluding. It's okay to not capture everything. You just need to be clear about it.
posted by phooky at 6:48 AM on January 8 [15 favorites]

I've encountered patio11's similar list of assumptions about names and someone else's list of date/time based assumptions. I reacted to both of those positively, with a sense of "oh crap there's so much I don't know" whereas my reaction to this list was that it was clickbaity/pedantry. Some other commenters on this post seem to have had the same reaction as me. My best guess to the difference in reaction is that this list was not preceded with any context, while the other two were. patio11's in particular points out that common assumptions about names can be very harmful to individuals, which reframes things as "use this list to help people" rather than "here's a list of clever gotchas that I know and you don't".
posted by mrgoldenbrown at 6:54 AM on January 8 [1 favorite]

This list appears to stem from a misconception that unicode can represent anything

Unicode can represent "anything" very well! It can also represent "аոуtһіոg"!
posted by GCU Sweet and Full of Grace at 6:55 AM on January 8 [9 favorites]

From the expanded text of the list:

More evidence that mathematical notation is a different writing system: digits are read left-to-right even in right-to-left writing systems.

I think this is also an oversimplification. Different natural languages, like different computer architectures, have different conventions for whether you say "big end" or the "little end" of the number first. In modern English we say "thirty-two," but in German a transliteration is "two and thirty." Germans, in their left-to-right script, write digits in the same big-endian notation as English.

Arabic is a right-to-left script and also a little-endian language. An Arabic reader who sees the symbol 32 is reading the digits from right to left to get "two and thirty," while I read the same symbols left-to-right in order to understand the same idea.

Note also that modern Arabic has its own numerals, which are different from the "Arabic numerals" on my keyboard. Both sets are in common use.
posted by fantabulous timewaster at 6:56 AM on January 8 [3 favorites]

"Plain text" is a meaningless phrase if you're not defining it.

The thing is, all of us define it based on the limitations we decided to live with, and this list is a good introduction to the existence of people who define it differently based on different limitations.

And yes, it's an annoying list to look at, but better to be introduced to these this way than by support tickets coming from annoyed customers for a live product.
posted by ocschwar at 6:59 AM on January 8 [1 favorite]

Kevin Deldycke maintains a thorough, and I believe the canonical, list of Falsehoods Programmers Believe here. I'm proud to have made a small contribution to the cause.
posted by mhoye at 7:07 AM on January 8 [9 favorites]

"Plain text" is a meaningless phrase if you're not defining it.

That's true in general, but in computing it's been long understood as a term of art to mean "ASCII", though I think now that's sort of informally shifted to mean "UTF-8".

But, of course, because this isn't well defined or even 100% true on its own merits much less in its interaction with the world and Humans Are Difficult, this is where a lot of these misconceptions come from.
posted by mhoye at 7:12 AM on January 8 [3 favorites]

>”Plain text" is a meaningless phrase if you're not defining it.

The thing is, all of us define it based on the limitations we decided to live with, and this list is a good introduction to the existence of people who define it differently based on different limitations

Philosophically that is nice but up through the late 1980’s “plain text” had the very specific meaning of “ASCII” for programmers, and I could have written you a long list of things people constantly got wrong about it.

The author is obviously using a different definition of “plain text” and it would be great if they could point at the spec as well as showing where people violate it.
posted by Tell Me No Lies at 7:16 AM on January 8 [4 favorites]

Falsehoods programmers believe about names

Oh, this is a tough one I had to explain to a customer.

A program we wrote reads values from their database, one of which is the customer-service-rep name (it's more complicated than that but that's the gist of it), and then pre-populate PDFs for forms that need to be mailed out.

The customer was asking if we could populate the field with FirstName LastInitial ("Azrael B") instead of FirstName Lastname.

Their database holds the CSR's name as a single field -- and just from a cursory adhoc query, it's full of FirstInitial MiddleName LastName, FirstName LastName MoreLastName, TwoPart FirstName LastName, and all various combinations of such. Like, how is our program supposed to convert "Mary Ann Johnson Smith" to FirstName LastInitial?

As a proper software project manager, I didn't explicitly say no, instead I told her we could do it when they change their database to store the CSR as separate FirstName and LastName fields.
posted by AzraelBrown at 7:18 AM on January 8 [13 favorites]

I've been out of the game for a few years, but it seems to me to break down into two problems: 1- what characters, and 2- parsing meaning from the stored text. I'll leave #2 alone.

I expect that #1 would be solved in a competent company by the selection of a proven library for text handling, or creating/modifying one. I recall such an effort in one company in the early 2000s when we adopted UTF-8.

Is there not a go-to text-handling library or two that already addresses the noted areas?
posted by Artful Codger at 7:28 AM on January 8

The customer was asking if we could populate the field with FirstName LastInitial ("Azrael B") instead of FirstName Lastname. Their database holds the CSR's name as a single field…

Hilariously, my Dad told me about this exact problem in the 1970's.
posted by panglos at 7:35 AM on January 8 [2 favorites]

I used to work with a guy who had the last name NAMEROW. When I first saw it, I thought it was some kind of database error.

I also enjoy the stories of people with the last name NULL.
posted by slogger at 7:36 AM on January 8 [9 favorites]

Unicode is such an improvement. A journeyman programmer can write straightforward code in a modern language and it will handle most text in most languages without any heroics. There's some complicated gotchas, Normalization (#53) is a real headache. And there's broken systems like MySQL's UTF-8. But overall it works a whole lot better than what came before. Compare for instance the 16 different ISO-Latin character sets.

My current bugaboo is properly typesetting an apostrophe. We mostly type the ASCII U+0027 ' APOSTROPHE but it gets displayed straight up and down which isn't really right. The usual fancy typesetting is U+2019 ’ RIGHT SINGLE QUOTATION MARK but it's right there in the name, that isn't an apostrophe at all. It's marked as "final punctuation" in Unicode but apostrophes occur in the middle of the word. U+02BC ʼ MODIFIER LETTER APOSTROPHE is sometimes used but that's defined as a letter, not punctuation, and is properly used more often for glottal stops. (But not to be confused with U+02BB ʻ MODIFIER LETTER TURNED COMMA, the ʻOkina of Hawaiian.) U+0027 is the right Unicode choice for an apostrophe according to the character tables, but most every text rendering library renders it wrong. The tragedy of modern Web typography.

Hilariously, I have no idea if you will see the various apostrophes I typed. A bunch of software "helpfully" translates one to the other behind the scenes, including something on my Windows machine when I simply copy and paste.
posted by Nelson at 7:37 AM on January 8 [6 favorites]

“Awesome Falsehoods”—jameslk, Github

A curated list of awesome articles about falsehoods programmers make about things which are simply untrue.

P.S. Which I now realize is a fork or duplicate of mhoye's link.
posted by ob1quixote at 7:41 AM on January 8 [1 favorite]

I did not think I was stepping onto a Hornets Nest with this post - but alas, I was in-fact punting a Hornets Nest.
posted by Faintdreams at 7:45 AM on January 8 [1 favorite]

mrgoldenbrown It's a riff on the older Falsehoods Programmers Believe about names and it's a combination joke/serious look at how we think about "plain text", what that actually means, and why sometimes the unspoken and often not even thought of assumptions we have can trip us up.

The article is jokey, but it does talk about some real issues.

We tend to try to bash data into our ideas of how it should exist rather than dealing with data as it actually does exist. There's an odd sort of flattening and loss of info as data from multiple cultures that aren't Western European or Western European derived, gets jammed into software that wasn't designed to accept it as it actually exists.
posted by sotonohito at 7:48 AM on January 8 [4 favorites]

I used to work with a guy who had the last name NAMEROW.

Another customer has a client whose last name is literally "TEST", luckily we caught this before running any of our own tests (we usually create a client named "OURCOMPANY TEST"), which could have affected actual production data (not that lastname is a key value, but just the way our software handles people data, and yeah we probably should rethink our standard test name)
posted by AzraelBrown at 7:49 AM on January 8 [3 favorites]

AzraelBrown And storing names in two fields as FirstName, LastName is it's own giant nightmare. What, exactly is, Maria Fernandez De Jesus De Santiago Gonzales De Los Angeles "last name"?

Worse, if you're trying to build an ASCII email address for that person, which (if any) of those names do you drop? Which name does she go by in her day to day interactions?

And that's assuming that a person has a single canonical name.

In my early days of database design I sneered at the simpletons who made Name a single field rather than the obviously superior FirstName, LastName setup. Today I throw my hands up in despair and think maybe the optimal setup is something like how we do for phone numbers where you store names in a separate one to many table so you can have however many name fields it takes to hold all the names a person has.

Of course then we have to ask how we can automate it so that "De Los Angeles" and similar names are entered into a single row in that table rather than three.....

And of course never forget "Robert'); DROP TABLE Students;--", aka little Bobby Tables.
posted by sotonohito at 7:56 AM on January 8 [10 favorites]

As a former software engineer, I can go through that list and quietly shake my head in agreement. As a current user of software based forms, etc. I loudly shake my fist at the people who create these forms, etc. making a lot of these errors. Maybe someone needs to compile a list of what programmers believe about the world in general?
posted by njohnson23 at 8:05 AM on January 8

For bonus points match each entry to a bug report.
posted by Artw at 8:06 AM on January 8 [4 favorites]

The plainest usable text I know of is 7-bit US-ASCII with only 128 code points...
posted by jim in austin at 8:06 AM on January 8 [1 favorite]

Then do dates.
posted by Artw at 8:06 AM on January 8 [1 favorite]

From one perspective, a lot of the ones about "alphabets" are really the fault of trying to represent writing as a linear sequence of characters that correspond one-to-one with the letters in that alphabet.

Within that paradigm, e and é need to be different characters, even though when I write them by hand, I think of an é as an e with an accent aigu.
posted by RobotHero at 8:13 AM on January 8

I always have to remind myself of this one:

Cookie Monster is not a letter of the alphabet.
posted by RonButNotStupid at 8:28 AM on January 8 [9 favorites]

Remember to always practice safe text.
posted by I-Write-Essays at 8:28 AM on January 8 [2 favorites]

I'm still mad that I can't do non-lining numerals in plain text because the Unicode Consortium say it's just a text style. Yet I can do Arabic numerals in a circle (①②③ ...), in parenthesis (⑴⑵⑶ ...), with a full stop (⒈⒉⒊ ...), in negative circled sans, etc etc etc — the great injustice of our age.
posted by scruss at 8:29 AM on January 8 [7 favorites]

EBCDIC! [8 mins] David Brailsford muses on the parallel Betamax universe which didn't make it.
posted by BobTheScientist at 8:29 AM on January 8 [6 favorites]

This is what comes of trying to make a universal machine to deal with all varieties of human expression.

It's like the Great Commission, except for ASCII, and like the Great Commission its inner logic leads it to destroy all the alternatives despite its putative desire to universally welcome all.
posted by clawsoon at 8:34 AM on January 8 [1 favorite]

I implemented a shift from first + middle + last name in the system I work with to a combination of full name (whatever you would consider as such, as many words as needed) and nickname (how you want to be addressed in emails, again free-form).

People are flummoxed by this, sometimes putting their first name in one and the last in the other, sometimes repeating themselves, sometimes utterly failing to read and/or comprehend anything and putting a phone number or somesuch in one. I've stubbornly stuck to my guns with the newer implementation because it hopefully un-others some small silent group of people among the userbase, but I still sigh to myself every time I see one where the user has misused the name fields.
posted by axiom at 8:47 AM on January 8 [7 favorites]

This is the very reason I do all my programming with semaphore flags.
posted by BigHeartedGuy at 8:58 AM on January 8 [2 favorites]

What fortunate timing on this post! Just this morning, a new user story appeared in my team's backlog. Seems one of the teams consuming our APIs ~~just heard of~~ meticulously researched UTF-16 encoding, and decided to make it their new way of transmitting string data, so they added a story to our queue for us to to add support for it. You have to jump through some extra hoops to assign estimates to stories in other teams' queues, but by gum, they went ahead and ballparked it at less than a week's worth of work anyway.

Now, instead of having to make a grim scarecrow out of the flayed skin on one of them, I can instead just print this article out and staple paper copies of it to each of the offenders' foreheads.
posted by Mayor West at 8:59 AM on January 8 [7 favorites]

I've lost count of the times I've allowed my browser to generate a "secure password" for me, only to have the website shriek and berate me because the password contains a character it doesn't like.

That is annoying but clear error messages make it somewhat tolerable. What isn't tolerable is that I seem to have run into a handful of instances where the tool that allows a user to set or reset a password is different from the tool that actually allows one to log in. In other words, there have been a few times when I've been able to successfully set a long, complex password but cannot use that password to log in. Through trial-and-error I have been able to figure out that the password setting or resetting tool accepts characters or password lengths that the log in tool doesn't - I have to iteratively shorten and weaken my password until I finally find one that works for both tools. And, of course, there are no helpful error messages or clues. Very, very frustrating that this happens when I'm doing the "right" thing by using long, complex (and unique, of course) password!

I also have some frustration that the software I use to stream music to different devices in my home doesn't recognize some Unicode characters. For example, Bon Iver's album 22, A Million has several song titles that have Unicode characters e.g., 22 (OVER S∞∞N), 21 M♢♢N WATER. Those songs simply don't appear in my music server's catalog. The titles of several KEYGEN CHURCH albums are entirely composed of Unicode characters (e.g., ░█░█░░█░█░█░) so those albums don't appear in my music server's catalog, either.
posted by ElKevbo at 9:00 AM on January 8 [7 favorites]

I also have some frustration that the software I use to stream music to different devices in my home doesn't recognize some Unicode characters.

Previously: if you remove the Segoe UI font, Windows 10 doesn't recognise the colon character in the taskbar clock.
posted by Cardinal Fang at 9:10 AM on January 8 [5 favorites]

Look. we can go on, but essentially text is limitation... It's okay to not capture everything

While that might be technically true, often, as with the names and dates lists, the baked-in assumptions are rooted in English-language/Western-worldview. When you're the one who has to bend yourself to fit someone else's assumptions, that says "this world is not meant for you." Anything that gets programmers to think outside their boxes and create more inclusive software is making the world slightly better than it was.

I haven't looked at the collection of lists but hopefully there's also something for improving accessibility.
posted by kokaku at 9:10 AM on January 8 [5 favorites]

OK, jim in austin, which character in this great universally-compatible ASCII universe of yours signifies the end of a line? What does every system in the universe (be it paper, screen, or otherwise) do with the ASCII 0x0B VERTICAL TAB character? What column do we wrap at? Are there page breaks inserted automatically due to printer constraints?
posted by rum-soaked space hobo at 9:20 AM on January 8

Philosophically that is nice but up through the late 1980’s “plain text” had the very specific meaning of “ASCII” for programmers, and I could have written you a long list of things people constantly got wrong about it.

Not quite. Lots of us meant "ISO-8859-X" when we said plain text, and we would consciously change from ISO-8859-1 (Western European Latin, with accents for Spanish, French, and German) to ISO-8859-whatever as needed. (-8. for me)

This of course had its own compromises. (French speakers had to use English style quotation marks, for example.)

And for older people, "plain text" meant ticker tape and Telex, which is what ASCII is based on. (This is why ASCII had characters for backspace (turn the spooler backwards one step), delete (push ALL the pins into the ticker tape, overwriting whatever might have been there), carriage return, new line, AND line-feed (spool lots of paper, don't worry about detaching the spooler, since we're about to eject the paper).
posted by ocschwar at 9:21 AM on January 8 [8 favorites]

"What isn't tolerable is that I seem to have run into a handful of instances where the tool that allows a user to set or reset a password is different from the tool that actually allows one to log in."

Related: Being able to set an email address with a + to register on a site/service (e.g. "user+roku@domain.tld") but being unable to log in with that address on various platforms because "invalid character" or whatever.

On the theme of edge cases when dealing with music, I enjoyed this post from 2022: "Horrible edge cases to consider when dealing with music" -- includes Prince's changing his name to a symbol, Spinal Tap using a dotless "i", and Ministry's ΚΕΦΑΛΗΞΘ.
posted by jzb at 9:22 AM on January 8 [3 favorites]

I still sigh to myself every time I see one where the user has misused the name fields.

I see where you’re coming from, but unless you have concrete evidence that people are screwing with your fields out of malice, it would seem that your “obvious“ solution is about as obvious to people as all the other misconceptions about what goes in a name field

How you want to be addressed in emails: "They call me Mister Tibbs!"
posted by Quinbus Flestrin at 9:28 AM on January 8

I have the "names" list bookmarked because it's a useful and humorous way to remind people that their conception of a name is based on WASP traditions. Usually this comes up in relation to records in which the person has only one name. Someone complains and I have to reassure them that it is indeed normal and ok to have a single-part name.
posted by tofu_crouton at 9:54 AM on January 8 [4 favorites]

axiom never used the word "obvious". Quoting a single word definitely can't be mistaken for paraphrasing. Please don't use it this way.
posted by tigrrrlily at 9:58 AM on January 8 [1 favorite]

I have no idea exactly what the plot would be, and I know that isn’t actually possible, but I kind of want to write a story now where someone writes a virus that breaks Unicode. Not just locally, but, like, the virus breaks the standard.

It’s called “babel” naturally. Or, if I’m feeling very fancy, babel-16.
posted by thecaddy at 10:23 AM on January 8 [5 favorites]

Now, instead of having to make a grim scarecrow out of the flayed skin on one of them, I can instead just print this article out and staple paper copies of it to each of the offenders' foreheads.

It's a while since I did Agile. Is this a formal part of the process now?
posted by Cardinal Fang at 10:24 AM on January 8 [12 favorites]

Now do calendars.

Someone made a pretty darn impressive attempt back in 1971 with the UNIX cal command. If you are inclined to do such things, check out the man page options
posted by treepour at 10:26 AM on January 8 [1 favorite]

It’s called “babel” naturally. Or, if I’m feeling very fancy, babel-16.

thecaddy, perhaps you meant Babel-17?
posted by panhopticon at 10:37 AM on January 8 [2 favorites]

Now, instead of having to make a grim scarecrow out of the flayed skin on one of them, I can instead just print this article out and staple paper copies of it to each of the offenders' foreheads.

Assuming you are already UTF-8 internally, the transformation - UTF-8 <> UTF-16 via unicode code points - is a clean bijection.

The hard part comes if you start talking about information implicitly based off the encoding of the text, like "at position 27 in this string". God help you then.
posted by NotAYakk at 11:05 AM on January 8 [1 favorite]

(I went through babel-18, babel-19, and babel-24 before settling on babel-16. For 16 bits, ultimately, but the reference to Delany was absolutely intentional.)
posted by thecaddy at 11:11 AM on January 8 [2 favorites]

Lots of us meant "ISO-8859-X" when we said plain text

Maybe! Or maybe we meant Windows-1252 or DOS code page 437. They all cover more or less the same characters as ISO-8859-1 but are awkwardly different. For years most of us (including me) didn't really understand any of this; text was just the stuff you got and spit out.

Unicode forced system designers to really think through text and the difference between bytes, characters, and code points. And it took about a decade to make that transition. Lots of stuff got it wrong in the interim, Python was a huge PITA for about five years. In some written languages (like Chinese) consensus implementation still has not fully converged.

I can't do non-lining numerals in plain text because the Unicode Consortium say it's just a text style. Yet I can do Arabic numerals in a circle (①②③ ...), in parenthesis (⑴⑵⑶ ...), with a full stop (⒈⒉⒊ ...)...

Point 38 in the original article speaks to this. Unicode's primary design goal was not to represent all written languages, which is certainly how I think of it most of the time. Their actual goal was "a system for text where you can translate existing documents into it and back out without any loss". As a result a whole bunch of weird stuff that aren't really atomic bits of writing got encoded. Some other system before Unicode used ① and so Unicode needed a way to represent it. I haven't done the research for the alternate numerals but often you can find some fascinating scholarship in the Unicode discussions about an obscure 1950s punched tape data processing system that did things a certain way.

This desire for transcoding is also what gave us emoji. ✨ is not really a letter or word. But emoji was added to Unicode because existing Japanese phone messaging systems were already doing something like emoji and Unicode needed a way to represent those characters. For instance 💩 was added to Unicode in 2007 because Shigetaka Kurita 栗田穣崇 had put it in a DoCoMo pager for teenagers in the 1990s.

Emoji has quickly become part of written language though. I'd argue ❤️ is as much a word in modern English as "heart" or "love". It's also turned out to be a good ambassador for Unicode, particularly the supplementary planes. A whole lot of broken Unicode software got fixed because people were mad that they couldn't store 🥰 in their database and get it back again.

(I love Unicode. Sometimes I think it'd be great to collect all the scholarship that went into the development of the blocks and make it available as a database indexed by code point.)
posted by Nelson at 11:20 AM on January 8 [18 favorites]

A while back I was mentioning some text-editor shenanigans to a friend who responded, "what's a 'text editor'?"

I knew they understood what "editing" is, so I tried to explain the concept of "text". I quickly and naively told them "letters and numbers in a computer file" but was immediately dissatisfied. For one thing, numbers are NOT text, though text can contain numbers, or at least numerals. Do I want to delve into type theory with this person? I do not.

Text is a subtle concept, or I'm dumb. Either way, it seems I need to review some of my assumptions.
posted by Rev. Irreverent Revenant at 11:21 AM on January 8 [3 favorites]

BigHeartedGuy: “This is the very reason I do all my programming with semaphore flags.”

Semaphore flags? An Aldis lamp is clearly superior!
posted by ob1quixote at 11:39 AM on January 8 [1 favorite]

The absolute minimum evert software developer absolutely positively must know about Unicode and character sets by Joel Spolsky. It is 20+ years old but I remember it being the first time I saw all this information put together in one place that was reasonably easy to digest. For reference Joel started as an intern working on Excel at Microsoft, left to form his own software company that made lots of pieces of software and one that you might have heard of Stack Overflow.
posted by mmascolino at 12:05 PM on January 8 [7 favorites]

I reacted to both of those positively, with a sense of "oh crap there's so much I don't know" whereas my reaction to this list was that it was clickbaity/pedantry. Some other commenters on this post seem to have had the same reaction as me.

I also had that reaction a little bit (it’s not so much that I think it’s clickbaity as that it feels like a weaker entry in the informal series) because while it highlights a few cultural blind spots along the lines of the “names” one… I feel like programmers of moderate skill and experience generally know that handling text isn’t easy, because they’ve learned the hard way? And the list doesn’t necessarily address that in a way that explains how one might do better. I guess one could say the same about time to an extent, though.
posted by atoxyl at 12:10 PM on January 8

Also, yeah, I think “plain text” means “ASCII” to a lot of people still and that’s a big ‘ol piece of “I’m going to make your life harder to make my life easier” cultural chauvinism. But people do it because they know that doing better is hard.
posted by atoxyl at 12:12 PM on January 8

>delete (push ALL the pins into the ticker tape, overwriting whatever might have been there)

heh, delete did actually mean 'to erase by smudging', not the "eliminate/remove" operation it is associated with now.
posted by torokunai at 12:32 PM on January 8 [4 favorites]

78. Any modern software should be able to handle UTF-8 by default.

Counterexample: Excel, which opens CSV files using WIN-1252.
posted by frogmanjack at 1:10 PM on January 8 [2 favorites]

Lovely. Where is the definitive reference on the Truth About Plain Text? And names? Is someone keeping a good blog so I can check future regexps?
posted by amtho at 1:27 PM on January 8

Spolsky's piece mostly holds up 20 years later. Except this part:

The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits) ...

we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type.

That's falsehood #29. That choice was reasonable in 2001. But it only supports Plane 0. That basically didn't matter in 2001 but these days there's lots of stuff (including emoji) on plane 1 and some supplementary CJK stuff on plane 2. UTF-16 is not UCS-2, it can have variable length encodings for the supplementary planes. I think in practice naive code treating Unicode as two bytes mostly works unless you try to find, say, the length of a string contianing emoji.

Here's where I pick on MySQL again. Because it's 2024 and they still have a broken utf8 type that can only store plane 0 characters. It loses emoji. The workaround is to use utf8mb4 instead but the simple sounding utf8 is still there waiting to ensnare newbies. Last I checked it didn't even print a warning if you used the broken one. It's been like this since 2010.
posted by Nelson at 3:14 PM on January 8 [6 favorites]

When I applied and received a new citizenship, the government employee handling my application was genuinely excited to tell me they can now write my name including a letter in my native language. Previously, I suppose they would have plaintexted it to whatever seemed closest.

And so that was nice.
posted by UN at 3:36 PM on January 8 [8 favorites]

Another customer has a client whose last name is literally "TEST", luckily we caught this before running any of our own tests

For a while there some poor bastard in the UK kept receiving random medicines, medical supplies and occasionally quite expensive medical gear in the mail from the NHS because his last name was “Test”.
posted by mhoye at 3:59 PM on January 8 [5 favorites]

Don't worry, you're still a number on the backend.
posted by clawsoon at 4:24 PM on January 8 [3 favorites]

The absolute minimum evert software developer absolutely positively must know about Unicode and character sets by Joel Spolsky.

Spolsky is talking about software developers; the original article is talking about programmers. They are not synonymous.
posted by Cardinal Fang at 4:25 PM on January 8

If you want to get into edge cases about how language can be mishandled by systems due to people not necessarily thinking through all the steps, the Radiolab episode "Null" [link includes options for listening, transcripts, and even Braille] describes how things can go wrong in the way data gets process in SO many ways!
posted by hippybear at 4:34 PM on January 8 [1 favorite]

Are they are being ambiguous on purpose or by accident? ie, what do they mean by plain text?

It's a standard terminology, to disambiguate versus the cipher text, ie text that has been treated by encryption algorithms like UTF or XML

/hamburger
posted by pwnguin at 5:12 PM on January 8 [1 favorite]

I also enjoy the stories of people with the last name NULL.
posted by slogger at 7:36 on January 8 [6 favorites +] [⚑]

I worked with a colleague for many years whose last name is Null. It was always interesting to try and intuit how each system interpreted the name. Often, adding a third "L" solved the problem, which implies that many systems would interpret a string with the contents "NULL" the same as a null SQL value. I...don't think that's correct, because a null value isn't the same thing as the string "NULL". Very interesting, when it's not your immediate problem.
posted by wintermind at 6:22 PM on January 8 [2 favorites]

I remember being in meetings, begging the DBAs and SAs to prioritize converting to UTF-8 from Latin-1 in the database. It finally happened, and shortly thereafter I was talking to a user from Hawai‘i. They really appreciated having place names and animals with proper diacritics in the system, an indication that at least someone on the mainland actually cared.
posted by rockindata at 6:53 PM on January 8 [3 favorites]

Some reporting systems represent the null value as "Null." Sometimes I'll be asked to do a custom report for a user, so I'll whip up some gnarly SQL is SQL Management Studio, and paste the results into Excel. All of my null values will be represented as (Null), so I'll have to clean that up before passing it on to them, because their R code probably won't be happy with it.
posted by Spike Glee at 6:55 PM on January 8

Wintermind, many times data is passed between systems as some kind of text serialization, and instead of a blank value, “null” is output instead. Its often in these kinds of systems that a name of null can start to cause all kinds of problems.
posted by rockindata at 6:56 PM on January 8

I'm pretty surprised I don't see a single mention of Dylan Beattie's "Plain Text" talk in this thread. There are a few versions of it, but this one from NDC Oslo 2021 is as good a choice as any.

PIKE MATCHBOX
posted by cardioid at 7:05 PM on January 8 [4 favorites]

> 38. Unicode has an elegant and harmonious design, otherwise it wouldn't be the most widely used encoding.

I've been programming computers since the late 1900s and I don't know a single person that thinks this.
posted by pmb at 4:01 AM on January 9 [1 favorite]

Disappointed that the OP didn't reference the original piece about time that inspired this, but glad that it was subsequently mentioned in the comments here, notably by LUMWPA above who mentioned this programmer meme in general including the names entry.

Credit where it's due, here are the original two articles:
http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time
http://infiniteundo.com/post/25509354022/more-falsehoods-programmers-believe-about-time

And the Mefi previously.
posted by intermod at 8:20 AM on January 9

Numbers in modern Arabic, when written out, do actually go left-to-right. Unless you're spelling them out with letters.
posted by lauranesson at 2:44 PM on January 9 [1 favorite]

Another recent entry in the "programmers vs reality" category: Weird things engineers believe about development:

Since I quit Mozilla and went back to full-time Web development, I’ve discovered a few surprises. It turns out Web development is actually pretty hard, Web developers are actually very smart, and some of these frameworks and techniques we mocked as browser engineers aren’t so bad. Oops.

At the same time, it turns out some Web developers have ideas about browsers and the Web that, as a former browser engineer and standards editor, I’m a bit dubious of.

posted by pwnguin at 5:43 PM on January 9

There's a theme here that is par for the course and nigh-inescapable: expanding inclusion to people not originally considered part of the in-group results in grousing from the original in-group.

If a system you're putting in place is made to be easy and intuitive to work with for most people... it's only going to reinforce the status quo. There's no magic "just make it better", you're going to have to let go of some assumptions that you've been using as a convenient shortcut up until now. Because we want people you've been ignoring so far to be able to come to the table, too.
posted by tigrrrlily at 8:34 AM on January 10 [2 favorites]

There's a theme here that is par for the course and nigh-inescapable: expanding inclusion to people not originally considered part of the in-group results in grousing from the original in-group.

That's basically the entire history of the Internet.
posted by Cardinal Fang at 4:01 AM on January 11

That's basically the entire history of the Internet.

And the entire world, I expect.
posted by Tell Me No Lies at 9:55 AM on January 11

« Older Bowiemas/Bowienalia January 8/10 | That's WHY He's Superman Newer »

This thread has been archived and is closed to new comments

MetaFilter

Falsehoods Programmers Believe About Plain Text
January 8, 2024 5:31 AM Subscribe

Tags

Share

Falsehoods Programmers Believe About Plain Text January 8, 2024 5:31 AM Subscribe

Tags

Share

Falsehoods Programmers Believe About Plain Text
January 8, 2024 5:31 AM Subscribe