iloveyou
February 15, 2021 1:06 PM   Subscribe

A belated Valentine's Day post on exploring foldable words by Brian Hayes
posted by They sucked his brains out! (15 comments total) 18 users marked this as a favorite
 
I like the original idea, but dear LORD this soon got complicated with all the math.
posted by jenfullmoon at 1:28 PM on February 15, 2021


I was charmed by the original thing, and then sort of just scrolled down. He's having fun figuring this stuff out, really, so good on him. Mostly, though, the whole opening is charming, and is a creativity I never would have thought of myself.
posted by hippybear at 1:31 PM on February 15, 2021


Or...

egrep '^i?t?w?a?s?a?p?l?e?a?s?u?r?e?t?o?s?e?r?v?e?y?o?u?$' /usr/share/dict/words
posted by Tad Naff at 1:52 PM on February 15, 2021 [5 favorites]


But is he still married to Lynn ?!
posted by Morpeth at 1:57 PM on February 15, 2021


But is he still married to Lynn ?!

The post indicates (somewhat obliquely) that they were together for 17 years.
posted by jedicus at 2:15 PM on February 15, 2021 [3 favorites]


I appreciated the walk through a series of efforts before he arrived at a solution he was satisfied with. Oftentimes algorithms are presented as though they sprang fully formed from the head of Knuth, which is useful if all you want is a recipe, but in my experience it's far more educational to be guided from the naïve solution, past some blind alleys, and through some improvements before finally arriving at an optimal or at least good-enough solution.
posted by jedicus at 2:19 PM on February 15, 2021 [6 favorites]


egrep '^i?t?w?a?s?a?p?l?e?a?s?u?r?e?t?o?s?e?r?v?e?y?o?u?$' /usr/share/dict/words
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
posted by They sucked his brains out! at 3:40 PM on February 15, 2021 [3 favorites]


I think he misses Lynn.

[There's no "m" and no "n" in "It's a pleasure to serve you," so he can't just fold it to "I miss Lynn." *sadface emoji*]
posted by chavenet at 3:48 PM on February 15, 2021 [2 favorites]


If I'd been there in the Maple Diner in 1967 and I noticed for the first time that I could fold the straw wrapper to say i love you, I might have been so overwhelmed by the coolness of it that I would hand it over even though I didn't love him, and then be too embarrassed to back out until we'd been married for 17 years.
posted by moonmilk at 4:45 PM on February 15, 2021 [6 favorites]


Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Some people take Jamie Zawinski's pithy observation altogether too seriously and end up doing far too much work as a result.

The thing about regexes is that the class of problems they can address is enormous and the underlying engines have been so finely tuned over the years that a well considered regex solution will usually run orders of magnitude faster than anything else.

Here's how Tad Naff's very tidy solution above runs on my rather elderly desktop PC:

stephen@jellyshot:~$ time egrep --ignore-case '^i?t?w?a?s?a?p?l?e?a?s?u?r?e?t?o?s?e?r?v?e?y?o?u?$' /usr/share/dict/words >/tmp/folds

real 0m0.009s
user 0m0.008s
sys 0m0.000s
stephen@jellyshot:~$ cat /tmp/folds
A
Al
Alar
Aleut

(540 lines snipped)

wove
wry
y
yo
you
stephen@jellyshot:~$

If I were going to turn it into a tool that I'd be using more than once, all I'd add is a tiny dab of code (itself built around a couple of regexes) to convert a plain-text candidate folding phrase into the corresponding search regex.
posted by flabdablet at 6:19 PM on February 15, 2021 [1 favorite]


in my experience it's far more educational to be guided from the naïve solution, past some blind alleys, and through some improvements before finally arriving at an optimal or at least good-enough solution.

I agree, so let me break down that way Tad Naff's solution works.

For those not yet initiated into the Dark Arts, here are the applicable rules of the regex game:

1. Every character in a regex except those considered special by the regex parser just matches itself.

2. ^ is a special character. When it occurs at the beginning of a regex, it says that the rest of the regex must match starting at the beginning of the line it's being matched against in order to succeed.

3. $ is a special character. When it occurs at the end of a regex, it says that the rest of the regex must match all the way up to the end of the line it's being matched against in order to succeed.

A consequence of rules 2 and 3 is that any regex beginning with ^ and ending with $ has to match the whole input line in order for the match to succeed.

Some examples: band is a regex that would match any of the input lines band or bandolier or abandon or contraband, because the sequence "band" occurs somewhere within each of those input lines; ^band would match band or bandolier, but not abandon or contraband; band$ would match band or contraband but not bandolier or abandon; and ^band$ would match band and nothing else.

3. ? is a special character. It means that matching the character immediately before it is optional. So ^b?and$ is a regex that would match only band or and and nothing else, and ^ban?d$ is a regex that would match only band or bad and nothing else.

The way to read something like ^b?a?n?d?$ is that it will match any input line consisting solely of each of the letters b, a, n and d in that order, of which any could be missing. band or bad or and or ad would match; dab wouldn't, and neither would bandaid or flubber.

In other words, it's a specification of a test that checks if you could make a given input line out of a given sequence of letters just by deleting some of them. Which is exactly the "foldable words" requirement.

egrep is a filter tool that matches a given regex against each of the lines in an input file in turn. Lines that match the regex make it through to the output of the filter, while those that don't are discarded. /usr/share/dict/words is a massive list of words, formatted as one word per line. So what Tad Naff's one-liner is doing is matching each of the words in a large dictionary, in turn, against the letter sequence from the drinking straw and outputting only those that can be constructed from it by deleting some of its letters.

And because the matching is done by a regex engine, there's masses of optimization going on under the hood to make it run fast. The cunning indexing scheme that Brian Hayes codes up explicitly in Python is almost sure to be in there somewhere, along with decades worth of other speedups invented by many, many people who have been thinking about these things for longer than anybody.

If any of this piques your interest, I strongly recommend taking note of Zawinski's warning and then boldly dipping a toe into the deep dark waters of regex coding regardless. It's certainly a decision I've never regretted. It's also easier to do nowadays than it was when I got started with it.
posted by flabdablet at 7:37 PM on February 15, 2021 [6 favorites]


Whether or not the author meant it to be, that's quite a nice poem he made there. Reminds me of this.
posted by How much is that froggie in the window at 10:29 PM on February 15, 2021


that's quite a nice poem he made there

I like this viewpoint. I thought his foldable work was sweet and sad and sincere, as human as it was mechanically generated. It was interesting to me in a deep way that not much programming does for me, these days.
posted by They sucked his brains out! at 10:59 PM on February 15, 2021 [1 favorite]


I like Debuggex for working on regexes; it generates railroad diagrams, which can help with debugging.
posted by Pronoiac at 11:01 PM on February 15, 2021


Ugh, I just noticed that I used "It was a pleasure", not "It's a pleasure" so all those results with W don't count, sorry. Also thanks to flabdablet for a) compensating for capital letters (--ignore-case) and b) that very generous and thorough explanation of my flip and doubtlessly inscrutable-for-most post. I'll add that this all assumes a Linux (or Mac) system; maybe Windows supplies grep or an equivalent now in Powershell, but I hopped off the Windows train years ago (mostly because it didn't provide useful tools like grep, or a handy list of all the words in the language).

Regexes are great and I don't deny that my one-liner is thanks to a lot of shoulder-standing, but the saying attributed to Zawinski does hold some truth -- for example, it's a common dictum that one shouldn't try to parse HTML with regexes. Right tool for the job, as they say. Of course, I parse HTML with regex because I am a bad person.

To make phrases out of a given text is a much bigger challenge. Getting "I", "love", and "you" in the results doesn't guarantee that the phrase could be made by folding: that could be the same "o" from the original occurring in both "love" and "you", for example, or the starting text might have the "y" preceding the "v". This is where the Python (or whatever other higher-level language) would come in, to track where in the original that the found words occur and then find possible phrases of non-overlapping words.

And then you'd get a bunch of "phrases" like "Aleut ore you" because the computer doesn't know about parts of speech, so in the end you still need a human to pick out the ones that make sense. It's kind of wonderful that Lynn could just, you know, do it.

I could tell a similar story about a childhood friend who showed me what one can do with an Eat-More candy wrapper, but maybe not.
posted by Tad Naff at 11:29 PM on February 15, 2021


« Older Have you ever felt excluded from the definition of...   |   Pier Paolo Pasolini's The Gospel of Saint Matthew Newer »


This thread has been archived and is closed to new comments