Plain as the 👃 on your :-)
March 13, 2022 9:41 AM   Subscribe

"So your database now needs to know, for every single piece of text... whether that is a place, person or company in Scandinavia or not, to get the alphabetical ordering correct." - Plain Text - Dylan Beattie - NDC Oslo 2021 (YouTube, 54m12s)
posted by flabdablet (45 comments total) 24 users marked this as a favorite
 
Both entertaining and informative; a rare combo. I am familiar with the topic but in reverse. On my website I display a number of "plain text" files with a .txt suffix. Modern browsers assume this suffix means they are documents and are US-ASCII compliant. Few are initially due to UTF-8 and alternative code page stuff. I had to create my own stream editor (sed) filter employing a brute force conversion into the 128 character codes available. Currently I scan each line of a file for any of ~50 possible conversions. Some are admittedly ugly...
posted by jim in austin at 11:33 AM on March 13, 2022 [3 favorites]


This was great! Put a lot of things together that I'd heard of before but didn't really understand -- big/little endian encoding, zalgotext, etc.
posted by ropeladder at 11:39 AM on March 13, 2022


This video is really quite delightful. I love the let's start from the Actual Beginning approach to the historical aspect. I've not finished watching all of it yet, but it's already in the category of "should be shown to first-year comp sci students" in my opinion.

Some of it puts me in mind of Tom Scott's classic Computerphile video about timezones and the falsehoods programmers believe about names. The conceptual space of "things that programmers don't know, or disregard, about The Whole World" is always one that benefits from people making approachable and engaging explainers.
posted by BuxtonTheRed at 11:46 AM on March 13, 2022 [7 favorites]


UTF-8 really is tidy. One of the things that makes my world slightly less satisfactory is the way that Unicode itself will be forever limited to 220+216 code points because of the horrible brokenness that is UTF-16.

I wish the world could give up entirely on half-assed 16-bit in-memory character codings and use UTF-8 for everything everywhere, but Java and Windows between them have UTF-16 permanently locked in now.
posted by flabdablet at 12:00 PM on March 13, 2022


I suggest tactical BOM's to break uses of UTF-16, flabdablet. It's in our style guide and recommended best practice: screw over UTF-16 use.
posted by k3ninho at 12:19 PM on March 13, 2022 [1 favorite]


This talk is fantastic. Thanks! I meant to preview the first few minutes while eating lunch and couldn't avoid watching the whole thing. A few random thoughts:

Looking up Wheatstone, which I only know about as the bridge guy, I'm surprised to discover all the other stuff he did. I've also had the experience of trying to spell something in Greek and realizing that I only actually know the letters that don't look like Roman letters, 'cause the other ones aren't used in physics. I'm not surprised they left those out.

Though I'm not a real programmer, I'm reasonably convinced one of the lessons of programming history is to always use more bits than you think you could possibly need. (I'm thinking of ASCII, Y2K, Y2038, hard-disk size limits, IPV4, every code I've had to debug because someone used a short float in the wrong place. . .) At the very least, one extra bit that means "do or don't use the next word also" seems worth it in almost any scenario I can think of.

@FakeUnicode on twitter is amazing. I posted some random confusion about flag emogi codes without mentioning their account and got a detailed and correct explanation in 15 minutes. Cheers to the passionate lunatic(s) who runs it.

Part of my partner's job is populating a database in a language that had content-rich writing but nothing close to an alphabet until the people who speak it were colonized. For historical reasons, if you move 100 km or 100 years in any direction from any city, the orthography is often radically different, even when the pronunciation is nearly identical. Most are only fix or six characters beyond ascii, but figuring out how to deal with searches properly is a real pain in the neck. Especially when it's meant to be used by people who may not have formal training. I don't know how you solve that problem without a Google of dollars and ten years to train archivists.
posted by eotvos at 12:53 PM on March 13, 2022 [1 favorite]


This was fascinating. Thanks for posting it!
posted by The Lurkers Support Me in Email at 1:20 PM on March 13, 2022 [4 favorites]


I love the let's start from the Actual Beginning approach to the historical aspect

Me too. I was a little miffed at not seeing any mention of Baudot and Murray codes or the horrible Babel of ASCII contemporaries like CDC display code and EBCDIC, but on balance I think the stuff he managed to wedge in by leaving those out was more than worth it.
posted by flabdablet at 1:44 PM on March 13, 2022 [2 favorites]


The things which stand out to me are the cooperation, willingness to change, and functionality, putting clear communication to the fore. Otherwise I do not understand one bit of it, other than brief run-ins with umlauts whose codes I forget. He is very nice to listen to while working on a project, because of his continually upbeat tone, and overall hopefulness. We gotta help where and when we can.
posted by Oyéah at 1:56 PM on March 13, 2022


I was a little miffed at not seeing any mention of Baudot and Murray codes

Yeah, the key difference between the stuff he mentions and those is that the Wheatstone and Morse encodings were meant to be interpreted by humans, while Baudot had the first code that was meant to be interpreted by a machine and just read directly from a printout. And then Donald Murray went "what if we could compose messages and save them as punched holes on paper instead of having to type everything in real-time" and quietly revolutionized telecommunications, because now messages could be queued up and sent as fast as the tapes could be run through a machine instead of how fast a typist could go, and that laid the basis for telegrams.
posted by wanderingmind at 4:01 PM on March 13, 2022 [1 favorite]


Modern browsers assume this suffix means they are documents and are US-ASCII compliant.
The browser is almost certainly responding to a Content-type: header sent by your web server, not to the file name. However, the web server is probably deciding which header to send based on the filename. You can modify this header to something like
Content-type: text/plain; charset=utf-8
or other character sets as appropriate.
posted by fantabulous timewaster at 7:41 PM on March 13, 2022 [5 favorites]


Though I'm not a real programmer, I'm reasonably convinced one of the lessons of programming history is to always use more bits than you think you could possibly need.

Nah, you really have to justify the use of more in-memory/storage space you need, even now in the 'big data' world. Efficiency is really what divides programmers from us dilettantes. I see this every day - the choices you make in data storage come out the back end in system complexity and maintenance cost.
posted by The_Vegetables at 8:51 PM on March 13, 2022 [1 favorite]


That was fascinating and delightful! (the mention of visiting Ukraine towards the end is heartbreaking though (the video was posted a week before the war began))

(Oh, and: Do date/time next!)
posted by gwint at 9:01 PM on March 13, 2022


There were a few things that weren't quite correct, but they were historical and/or tangential.

I feel like there was a third, but the two bits I remember were:
  • The \n vs \r\n division came from having vs not having device drivers. Consider the systems that used \r and that becomes clear that's not really the case.
  • The systems that lock us into UTF-16 now didn't do so for speed, they did so because at the time those decisions were being made UCS-2 was going to be enough for anyone.
Nitpicks aside - he focused on the right things and it was an enormously engaging presentation. I was unaware of the Scandinavian collation issue around the aa/Ã¥ changes. That might be even more 'fun' than the Turkish i problem.
posted by bcd at 9:49 PM on March 13, 2022 [1 favorite]


My favorite thing about "plain text" is that pretty much nobody is aware of their local encoding. Your Windows machine chooses the right one when you set up your computer, and then your files are saved and retrieved in that encoding and you never have to think about it.

Except when I get your files and the encoding is a mystery and it's very nearly interpretable but not quite and I end up with those fancy diamonds, or more insidiously, suspiciously wrong or dropped characters. And then I get to do the dance of "where did this come from and what is the set of potential encodings so I can look and see if I get a set of reasonable letters with diacritical marks out of it".
posted by that girl at 10:49 PM on March 13, 2022


You can modify this header to something like

Content-type: text/plain; charset=utf-8

or other character sets as appropriate.


I agree with this advice, except for the implied support for the notion that any charset other than utf-8 could be appropriate.

Advising UTF-8 as the default encoding for all text files coming from a web server can cause the occasional problem, but nowhere near as bad or as many as declaring other encodings and certainly not as many as trying to shoehorn absolutely everything into ASCII.

The UTF-8 encoding of any existing ASCII text file is bit-for-bit identical with the ASCII encoding, and UTF-8 files do not require bullshit garbage like UTF-16 byte order marks to be prepended when saved, so browsers will process all of the files you've already shoehorned exactly as they're already doing if the only change you make is telling them to expect UTF-8 instead of ASCII.

Given that you've already got some kind of workflow in place for doing ASCII shoehorning with sed, it shouldn't be too hard to upgrade that to a completely clean conversion based on iconv -t UTF-8 instead.

Even if you insist on staying stuck in the 1960s with ASCII, iconv -f something -t ASCII//TRANSLIT is probably going to be more maintainable and yield more consistent results than an evolving 50 line sed monstrosity.
posted by flabdablet at 2:05 AM on March 14, 2022 [1 favorite]


Ha, I was just going to mention that I used to have an iconv based Perl script that tried every conversion to at least attempt to narrow things down to the ones that didn't go BOOM when converting. I'd still guess that Perl and even more so Raku probably still have the most comprehensive Unicode handling of anything else out there.
posted by zengargoyle at 2:24 AM on March 14, 2022


It's technically not full ASCII. It's US-ASCII in all its 128 character-code glory. And its running on a public server. Sometimes I simply have to get creative, as in:

s/Є/EUR/g
s/£/GBP/g
posted by jim in austin at 6:47 AM on March 14, 2022


The \n vs \r\n division came from having vs not having device drivers. Consider the systems that used \r and that becomes clear that's not really the case.

How so? He effectively says that it is an inherited convention - Windows does it because DOS did, DOS does it because CP/M did, and CP/M did it because it did in fact not have typical device drivers in its early implementations. So it's not that the systems that use \r don't have device drivers themselves, but use a convention that is that way because the originators of that convention weren't using a system with a typical modern device driver model.
posted by Dysk at 7:50 AM on March 14, 2022


Sorry, I should have sketched my disagreement out in more detail. He asserts that the reason the CP/M-DOS-Windows lineage of systems use \r\n is because it started in an era without a modern device driver model, unlike Multics-Unix-Linux.

That doesn't hold water when you consider the systems that used \r, i.e. pretty much all pre-IBM PC 8-bit computers - Apple II, C64, TRS-80, Acorn, Spectrum, BBC Micro. They certainly didn't have a modern device driver model either, and that didn't push them to \r\n.

It's just backcompat within each family, not really about any of them being a more modern design.
posted by bcd at 8:17 AM on March 14, 2022 [1 favorite]


It's technically not full ASCII. It's US-ASCII in all its 128 character-code glory.

I was unaware of any difference; I'd always understood "ASCII" and "US-ASCII" to refer to the same 7-bit encoding standard. What is this "full ASCII" that your site is technically not serving? ASCII is not CP-1252, and nor is it WTF-8.

Just tested iconv -t ASCII//TRANSLIT on a machine with a UTF-8 locale, and it does convert € and £ to EUR and GBP. But if you're actually needing to deal with source material that's abusing Є to mean €, I can offer you nothing more useful than sympathy.
posted by flabdablet at 8:51 AM on March 14, 2022 [1 favorite]


They certainly didn't have a modern device driver model either, and that didn't push them to \r\n.

But they also didn't generally need to interface with teletypes, so their choice of newline character is entirely arbitrary, not bound by the same context. The fact that the logic doesn't apply to those machines doesn't mean it didn't to OSes with a different lineage and context.

(And yes it is backcompat in both cases, it's more about the reason for the original decision that set the standard)
posted by Dysk at 9:08 AM on March 14, 2022


I can offer you nothing more useful than sympathy

QFT

they also didn't generally need to interface with teletypes

But neither did anything running CP/M - it's choice was equally arbitrary, not driven by device driver models.

In fact, I'd hazard that most of the machines that were born with teletype consoles didn't save text files with any sort of newline sequence. IBM mainframes OSes and DEC VMS used record-oriented file systems with a line per record and not text streams at all.
posted by bcd at 9:24 AM on March 14, 2022 [1 favorite]


But neither did anything running CP/M

Not so. CP/M was in use on loads of S/100 machines that existed before video displays and inbuilt keyboards became ubiquitous.

The rationale for using CR rather than LF as an end-of-line marker in text files on systems designed after video displays and inbuilt keyboards did become the norm is that by then it was actually pretty rare to encounter a keyboard data entry workflow that required separate CR and LF keystrokes. The Return key was what you hit to terminate a line of entered text, so that's what ended up being used as the line separator in text files on those systems.

I am, however, dubious that the rationale for using LF alone as a line separator in Unix text files had much to do with the presence or absence of device drivers. Seems to me that it's a natural consequence of the fact that in order to make a Teletype print an empty line, all you need to do is send it a LF. The CR isn't necessary unless the line isn't empty which, if you tilt your head and squint just right, makes the CR part of the contents of a line, not part of the separator.

AFAIK the only system that ever put any actual thought into this was Niklaus Wirth's Modula language, which used the ASCII record separator character (RS, \x1E) to separate records in text files. All the other "plain text" conventions are just various degrees of convenience hack; as the presentation in the OP makes perfectly clear, "plain" text never is, and as soon as what you need is a format that isn't just a straight-up capture of either keyboard input or line printer output but is instead a somewhat abstracted sequence of records, repurposing printer control characters to do the record separation job is pretty bogus.
posted by flabdablet at 9:44 AM on March 14, 2022 [1 favorite]


CP/M was in use on loads of S/100 machines that existed before video displays and inbuilt keyboards became ubiquitous.

With teletypes though? I only remember seeing them with CRT terminals attached. That said, they'd have been pretty old/dumb terminals, and a 'glass tty' is still a teletype as far as needing both a CR and LF before emitting the next line.

Squinting and calling the LF a line separator and the CR the terminator on the end of the previous line is... actually the best rationale I've heard for that decision. Let's go with it.

I am, however, dubious that the rationale for using LF alone as a line separator in Unix text files had much to do with the presence or absence of device drivers.

Yup. That's been my point. Still, a great presentation, and fun to chat with you all about geeky details few remember anymore.
posted by bcd at 10:10 AM on March 14, 2022 [1 favorite]


With teletypes though?

You bet.
posted by flabdablet at 10:46 AM on March 14, 2022 [1 favorite]


That's why the text editor that came with it was a command-driven line editor, not any kind of visual editor.

It can be quite hard for people who have never used any editor worse than the EDIT that came with DOS to understand just what a luxury it is to have an x-y addressable text screen to work with. Line editors look good only by comparison with punch cards.
posted by flabdablet at 11:07 AM on March 14, 2022 [2 favorites]


Heh. Through half of college I used a line editor on a Honeywell mainframe running CP-6 for most 'plain text'. (Though I preferred the half-way-to-a-screen-editor system that was built into the APL workspace support for code in those years. (I realize now I have no idea what encoding that used.)) I didn't start using the new-fangled vi editor until we got some Unix machines in junior year, I think.
posted by bcd at 11:25 AM on March 14, 2022


The first full-screen editor I ever used was vi as well, on an HP 3000. Until then, the best text editor I'd used was XEDIT on NOS.
posted by flabdablet at 11:31 AM on March 14, 2022


Actually that could be a lie. By the time I got access to a campus computer capable of running vi, I'd probably already been using UCSD Pascal on the family Apple II+ for some while, and that had a screen editor for writing code in.
posted by flabdablet at 11:35 AM on March 14, 2022 [1 favorite]


I was unaware of any difference; I'd always understood "ASCII" and "US-ASCII" to refer to the same 7-bit encoding standard.

See: Extended ASCII
posted by jim in austin at 2:54 PM on March 14, 2022


This is great, thank you.

But potentially dangerous and expensive.

Please be warned that there is an extremely funny joke about 27 minutes in.

This joke just caused me to choke on a mouthful of red wine: missing my laptop by inches, I instead spat it all over the carpet.

it's been one of those days
posted by motty at 3:29 PM on March 14, 2022


See: Extended ASCII

If you're going to continue belittling ASCII as "not full ASCII" because it's not the same as any of the 8 bit encodings that embed ASCII as a subset, you and I cannot be friends.

In the circles I move in, "not full ASCII" is typically reserved for describing character sets that are somewhat ASCII compatible but have ranges missing, like the text display encoding used by the Apple II+.

That one was an essentially six bit encoding that left out all the characters coded as 0x00-0x1F and 0x60-0x7F in ASCII and used the bottom 6 bits of their ASCII code to represent the rest, with the top two bits of a display memory byte specifying display attributes.

My own Apple II+ has a lowercase mod, a physical toggle switch mounted under the keyboard that's wired back to a character generator EPROM with twice the original PROM's capacity and lets me choose whether display data bit 6 goes to the original flashing text generation circuit or to the extra address line on the EPROM. I burnt my own EPROM for this too, being dissatisfied with the font quality of the commercially available Dan Paymar mod.
posted by flabdablet at 9:42 PM on March 14, 2022


If you're going to continue belittling ASCII as "not full ASCII" because it's not the same as any of the 8 bit encodings that embed ASCII as a subset, you and I cannot be friends.

Heh! Actually I'm doing just the opposite. The IANA specifically supports the name US-ASCII for the 7-bit character set suggesting that the term ASCII is more generic in usage. A glance at my website (linked on my profile) will show upwards of 100 .txt US-ASCII compliant files with nary a whiff of code page 437 to be found...
posted by jim in austin at 7:59 AM on March 15, 2022


The IANA specifically supports the name US-ASCII for the 7-bit character set suggesting that the term ASCII is more generic in usage.

Heresy! Mendacity! Calumny! Lies!

Wikipedia has it right
:
ASCII
From Wikipedia, the free encyclopedia
(Redirected from US-ASCII)

Not to be confused with MS Windows-1252 or other types of extended ASCII.
This article is about the character encoding. For other uses, see ASCII (disambiguation).
The international standard ISO 646, which followed ASCII and also specifies a 7 bit encoding, was developed in conjunction with ANSI (the organization that published the original ASCII spec) and sometimes did get referred to as "ASCII" even though it also specified non-US variants; for example, for a while it was possible to buy "UK ASCII keyboards" with e.g. a £ on the 3 key where the # would be on a US one, and "ASCII" printers with DIP switches you could set to make them print e.g. £ instead of #.

So I can see some justification for IANA's preference for its made-up name "US-ASCII" on disambiguation grounds. But I remain firmly of the opinion that referring to ASCII as "not full ASCII" on no better basis than that it isn't CP-437 or Windows-1252 or any of the other 8-bit extended hacks is horrendously disrespectful and not to be tolerated. If you gainsay that, I'm afraid it will have to be pistols at dawn. Sir.
posted by flabdablet at 9:14 AM on March 15, 2022


As for anybody who insists on pronouncing "ASCII" as eh ess see two, it won't be pistols at dawn, it will be a great big net and relentless poking with pointed sticks.

You know who you are.
posted by flabdablet at 9:22 AM on March 15, 2022


This seems to be a problem of usage and scope. I'm saying "golf ball" while holding a golf ball in my hand. I use it specifically to play golf. You're simply saying "ball" as you look at it. Technically we're both correct, only I'm being more precise by excluding all other types of ball-like objects. Now I think I'll take my US-ASCII Titleist and go work on my short game...
posted by jim in austin at 9:45 AM on March 15, 2022


This was great. The nit that I can pick is that æ really is a distinct character from an a-e ligature. It's named "ash," pronounced like the "a" in "cat," and was used in Old English (admittedly, I don't know if "ae" ever would have appeared in Old English apart from æ).

He edged right up to talking about Volapuk encoding, which would have been a fun digression. And apart from saying "we can't even get out of bed with 7 bits," he didn't talk about any of the multibyte encoding systems that preceded Unicode. Japanese alone had three systems in widespread use, one of which used only the lower seven bits in a byte, together with escape sequences, to survive old e-mail systems intact. I don't miss dealing with all that.
posted by adamrice at 10:52 AM on March 15, 2022 [1 favorite]


I'm saying "golf ball" while holding a golf ball in my hand. I use it specifically to play golf.

Sure, and I'm not objecting to that.

I'm just all up in arms about the part where you said it was technically not a full golf ball because it's only golf ball sized.

For the dignity of golf balls everywhere, I demand a full retraction.
posted by flabdablet at 12:01 PM on March 15, 2022


Volapuk looks quite 1337.
posted by flabdablet at 12:09 PM on March 15, 2022


I got my "well volunteered" side-not-job at a Japanese language learning website because of SJIS plus PHP plus crappy PHP code that didn't use proper escaping with the SJIS setup MySQL database..... There were like 12 kanji that couldn't be used because their encoding included a '\' that got mangled. I proposed a fix to the owner, he just gave me the credentials and said "it's your problem now". I fixed the PHP, then fixed about 7 of the bad kanji by reverse engineering, but had to crowd source the remainder because not isomorphic. Then I just converted the database and site to UTF-8, fuck SJIS, submitted patches to the PHP bulletin board software to fix their horrible code.

I have a feeling that horrible encoding stories are a shibboleth of text wrangling.
posted by zengargoyle at 12:30 PM on March 15, 2022 [1 favorite]


Interestingly, UTF-8 has gotten beyond all this ASCII, US-ASCII nonsense and simply refers to their first 128 code points as Basic Latin. Works for me...
posted by jim in austin at 1:00 PM on March 15, 2022


A glance at my website (linked on my profile) will show upwards of 100 .txt US-ASCII compliant files with nary a whiff of code page 437 to be found...

Finally got around to taking that glance and, as it happens, your site appears to be configured to serve those files with a content-type: text/plain HTTP header that has no explicit charset specification, which is going to make browsers pick a default encoding client-side, which has a good chance of being wrong.

For example, on my browser the last line of /US-ASCII/Reads/jefferson_bible_edit.txt has a Š in it that's clearly supposed to be a ©, the browser having guessed that the intended encoding was ISO-8859-2 when it really should have been Windows-1252; /US-ASCII/Reads/dicta_philosophi.txt has a bunch of UTF-8-encoded … ellipsis characters sprinkled throughout, which my browser renders as … because it's incorrectly assumed Windows-1252. US-ASCII/Geekish/various/units.txt is clearly supposed to be UTF-8 (parts of it are even tagged as such with !utf8 and !endutf8 markers), but again the browser guesses Windows-1252 and mojibakes it.

The complete list of non-ASCII files I found on the site is as follows:

/index.html
/US-ASCII/Comestibles/recipes/1886_cheese_soup.txt
/US-ASCII/Comestibles/various/maillard.txt
/US-ASCII/Geekish/various/misophonia.txt
/US-ASCII/Geekish/various/rfc-1121_poets.txt
/US-ASCII/Geekish/various/units.txt
/US-ASCII/Reads/borges_enigmas.txt
/US-ASCII/Reads/boswell.txt
/US-ASCII/Reads/burkean.txt
/US-ASCII/Reads/dicta_philosophi.txt
/US-ASCII/Reads/jefferson_bible_edit.txt

and my best guess is that all of these are UTF-8 encoded except jefferson_bible_edit.txt, which is Windows-1252. So adding a charset=UTF-8 qualifier to your text/plain and text/html headers would fix almost all of these issues without messing up any of the majority of files that are indeed ASCII, leaving only jefferson_bible_edit.txt in need of format conversion.
posted by flabdablet at 2:13 PM on March 15, 2022 [1 favorite]


Metafilter: Calumny! Lies! Pistols at dawn! Poking with pointy sticks! btw, I fixed your website.
posted by fantabulous timewaster at 2:42 PM on March 15, 2022 [4 favorites]


The same speaker has another really fun video on yt -> The Art of Code

I am not a CS person at all and know scarcely enough to be dangerous but I always find these threads to be ENDLESSLY diverting. A couple of memories have been coughed up, however:

- typing a program on a vic 20 out of Compute! magazine in the early 80s that allowed one to edit the pixels in a given character map and either remap it to the keyboard or park it to call elsewhere. We had a huge amount of fun with that. and yet, like everything else with me and computers as a kid, I never got curious about what was going on under the covers, and ended up studying chemical engineering of all things. The amount of time I spent AROUND computers as a kid you'd have thought for sure statistically I'd end up a CS nutjob of some flavor.
- friend of mine had a PC junior and it had, in addition to "mode CO80" and "mode CO40" in the dos text view, an option for "mode CO20". the characters all looked like fat donuts. I used to try and use it all the time because I thought it was so much fun but just managed to piss off my friend.
- I know FONTS are a whole other thing...but we had an epson dot matrix hooked up to an original ibm pc (not even an XT! what's a hard drive?!) and it had dip switches on the back that you could use to set a few different fonts: italics, small caps, I think there was some gobbledygook wingdings. I used to fuck with it late at night and then my dad would get up and lose his shit because his stuff wouldn't print right.
posted by hearthpig at 6:18 AM on March 16, 2022 [1 favorite]


« Older ‘If you want the girl next door, go next door’:...   |   Body Chill Newer »


This thread has been archived and is closed to new comments