The Bit Bomb
August 30, 2017 4:45 AM Subscribe

"Shannon paved the way to do this rigorously, by encoding our messages in a series of digital bits, each one represented by a 0 or 1. He showed that the speed with which we send messages depends not just on the kind of communication channel we use, but on the skill with which we encode our messages in bits. ... If we could not compress our messages, a single audio file would take hours to download, streaming videos would be impossibly slow, and hours of television would demand a bookshelf of tapes, not a small box of discs. All of this communication – faster, cheaper, more voluminous – rests on Shannon’s realisation of our predictability."
posted by jenkinsEar (26 comments total) 47 users marked this as a favorite

That's a lovely article. I remember learning information theory in college and how that changed my perception and understanding of communication, compression and encryption forever.
posted by Emmy Noether at 5:07 AM on August 30, 2017 [3 favorites]

That's a remarkably clear and well written explanation of information theory, and entertaining to boot.
posted by diziet at 5:19 AM on August 30, 2017 [2 favorites]

Yay, a [short] Claude Shannon thread!

I welcome any chance to recommend "An Introduction to Information Theory: Symbols, Signals, and Noise," which synthesizes/summarizes a lot of the early information theory.
posted by aspersioncast at 5:25 AM on August 30, 2017 [5 favorites]

I loooooooove teaching error-correcting codes. It's one of those things any student can grasp and which kind of leaves you no room to deny that math is awesome
posted by escabeche at 5:43 AM on August 30, 2017 [7 favorites]

My discipline, which can benefit massively from information theory, remains largely unimpressed by the idea. In fact it's a bit hard to find peers who are willing to review. ¯\_(ツ)_/¯

/randomfrustratedrantonmefi
posted by runcifex at 5:47 AM on August 30, 2017

That sounds like an opportunity! What discipline is this?
posted by TreeRooster at 5:49 AM on August 30, 2017 [2 favorites]

Pedant on the second quote: Audio/visual compression relies on discarding selected information, which contributes less to meaning. E.g. lossless audio compression in the form of FLAC can about halve the bitrate. Compare to Opus or MP3, which reduce the bitrate 1000 or 500 times, without us being able to tell the difference. Hence why the full quote is he "he pointed the way toward some of those codes"; it's used by the article in passing, as little slight of hand to introduce and contrast error-correcting codes.

Lossless coding is still important in encoding the information that remains. But the first quote is more significant: getting to the point where you can talk interchangeably about moving pictures, magnetic tape, or patterned disks, as holding the exact same information. (If I was forced to replace that paragraph in terms of lossless codes, I would say they're used to transfer/store graphical interfaces (including websites); in this case there's much more strict redundancy than audio).
posted by sourcejedi at 5:51 AM on August 30, 2017 [3 favorites]

The article is quite good. The point is really well made that random strings contain more information, so that information and meaning are inversely related, in some sense.

Are there easy examples of how compression works to take advantage of this? Is it entirely ad-hoc, as in dependent on the language and the alphabet, so that we can introduce a code for "QU" and save a bit or two?
posted by TreeRooster at 5:53 AM on August 30, 2017 [1 favorite]

A short comic peripherally about Claude Shannon.
posted by automatic cabinet at 5:53 AM on August 30, 2017 [5 favorites]

sourcejedi: Compare to Opus or MP3, which reduce the bitrate 1000 or 500 times, without us being able to tell the difference.

500 times? More like 10 times for MP3 if we treat 128 kbps as transparent. And 40-50 times for Opus encodes of speech.
posted by Gyan at 6:06 AM on August 30, 2017 [4 favorites]

Comparing the beginning of the experiment with its end – ‘XFOML RXKHRJFFJUJ’ versus ‘ATTACK ON AN ENGLISH WRITER’ – sheds some light on the difference between the technical and colloquial significance of ‘information’. We might be tempted to say that ‘ATTACK ON AN ENGLISH WRITER’ is the more informative of the two phrases. But it would be better to call it more meaningful. In fact, it is meaningful to English speakers precisely because each character is less surprising, that is, it carries less (Shannon) information.

See "The First Law of Complexodynamics".
posted by a snickering nuthatch at 6:11 AM on August 30, 2017 [1 favorite]

What's a couple of zeros between friends? I was treating Opus as transparent at 128kbits, which means around 10x for Opus (and 5x for MP3 at 256kbit).
posted by sourcejedi at 6:14 AM on August 30, 2017

He flipped to a random passage in Chandler’s story, which he then read out letter by letter to his wife, Betty. Her role was to guess each subsequent letter until she got the letter right, at which point Shannon moved on to the next one. Hard as this was at the beginning of words, and especially at the beginning of sentences, Betty’s job grew progressively easier as context accumulated. For instance, by the time they arrived at ‘A S-M-A-L-L O-B-L-O-N-G R-E-A-D-I-N-G L-A-M-P O-N T-H-E D’, she could guess the next three letters with perfect accuracy: E-S-K.

Claude expanded on this idea by imagining a game: get one partner to guess the successive letters of a sentence as Betty did here, and write down (for each letter) how many guesses it took. Then give a third player just those numbers. That third player generates the specified number of guesses for the first letter, then the second, etc. (I recently used this to encode an FPP title.)

Of course, unless the second and third players are in perfect agreement about letter, word, and sentence frequency, they will get different messages.
posted by a snickering nuthatch at 6:25 AM on August 30, 2017 [2 favorites]

A short comic peripherally about Claude Shannon.

1
posted by leotrotsky at 6:41 AM on August 30, 2017 [5 favorites]

Are there easy examples of how compression works to take advantage of this? Is it entirely ad-hoc, as in dependent on the language and the alphabet, so that we can introduce a code for "QU" and save a bit or two?

The compression is running on top of another encoding - these days text strings are often encoded as UTF-8. So before the compression algorithm gets to the data, it's already a stream of bits, and the algorithm doesn't know where one character ends and another begins.

However, if there are multiple "QU" pairs in your string, the algorithm will "notice" that repeating bit pattern and take that into account.

tl;dr - lossless compression algorithms are usually data agnostic and just look at the bits involved.
posted by murphy slaw at 6:50 AM on August 30, 2017 [3 favorites]

> Are there easy examples of how compression works to take advantage of this? Is it entirely ad-hoc, as in dependent on the language and the alphabet, so that we can introduce a code for "QU" and save a bit or two?

One of the most powerful examples, for me, was taking a short phrase and compressing it using the LZW algorithm by hand (as in, on paper). You really see, almost instantly, how the compression takes advantage of repetition!

As to "ad hoc" -- yes and no, depending on how you mean it. The way LZW works is by building a dictionary as it walks down the input string. In that sense, the encoding is specific to its input, and hence dependent on the language and alphabet of whatever is being encoded. But it's not pre-specified in any way; that is, I don't tell it ahead of time "my input is English, so use a single bit for QU" -- it "learns" that itself as it chugs along.

(Doing LZW by hand was so striking to me that I still have the piece of paper I did it on, years later. The phrase I used was "the rain in spain falls mainly on the plain," which is nicely repetitive & gives results quickly. It was nearly a decade ago, I can tell you exactly where that piece of paper is. I recommend it -- it's a really fun and instructive exercise!)
posted by Westringia F. at 6:53 AM on August 30, 2017 [14 favorites]

Would have been nice to give a reference to the original 1948 paper, it's not hard to get hold of even if you're worried about linkrot. (Eg here is a pdf)

Shannon is definitely one of the architects of the digital era.
posted by PMdixon at 7:00 AM on August 30, 2017 [1 favorite]

It is in the first paragraph I am an idiot for missing that the first time.
posted by PMdixon at 7:01 AM on August 30, 2017

Are there easy examples of how compression works to take advantage of this?

Visually think about a news anchor in front of a plain background. Just keeping track of the plain area in all the frames as one "byte" of blue reduces the video by perhaps 50%. This is NOT how it works and the algorithms use incredible math to shake out every unneeded bit of data . But it is kinda essentially how it works, remove repeats and redundancies, making a list (table) of each color(bit) actually used and ignoring stuff not in the picture/file.
posted by sammyo at 8:05 AM on August 30, 2017 [2 favorites]

tl;dr - lossless compression algorithms are usually data agnostic and just look at the bits involved.

Maybe I'm misunderstanding you but something like FLAC isn't exactly data-agnostic, is it? At least, its presumably optimized for the properties of audio? Generic compression formats like zip or rar (which are generally based on LZ*, Huffman coding, I dunno what else) are pretty data-agnostic.
posted by atoxyl at 9:40 AM on August 30, 2017

It is in the first paragraph I am an idiot for missing that the first time.

On the bright side, by repeating the message you've reduced the probability that other people will fail to receive it.
posted by justsomebodythatyouusedtoknow at 9:52 AM on August 30, 2017 [11 favorites]

If we reverse the Shannon letter guessing game, and I give you a letter and you guess the one that came before it, you'll still do well enough at it. This might be the key insight for why Borrow-Wheeler transforms works. If I give you an n, there's a very short list of mostly vowels that are likely to precede it.

The only good explaination of BWT I've found online is from a bioinformatics researcher. BWT helps compression because it tends to put strings of the same letter in a row. And BWT achieves this because of the Markov properties of the English language.

Compressor heads series covers a small range of compression topics, but of relevance to my comment is this one on BWT, featuring an interview with Burrows.
posted by pwnguin at 9:54 AM on August 30, 2017 [1 favorite]

Maybe I'm misunderstanding you but something like FLAC isn't exactly data-agnostic, is it?

Its compression algorithm, contained in libFLAC, is indeed optimised for audio, but nothing technically prevents it from being used to compress a JPEG or something. It just won't do it as efficiently as it compresses audio.

However, the need for archival digital compression of visual and audio media is relatively new, so most of the older compression algorithms are indeed data-agnostic.
posted by tobascodagama at 11:07 AM on August 30, 2017

Its compression algorithm, contained in libFLAC, is indeed optimised for audio, but nothing technically prevents it from being used to compress a JPEG or something. It just won't do it as efficiently as it compresses audio.

A fundamental principle of lossless compression is whatever your algorithm, if you want to make some inputs smaller, you have to make other inputs bigger. FLAC is optimized for "typical" audio, and assumes that its input is representing a continuous function. It works by extrapolating the next sample from the previous ones, and only compressing the difference. This might kind of work okay for some images, but it'll pretty much always be worse than PNG. FLAC would work particularly badly on discrete inputs, like text or compiled software, where the bytes are just mapped to arbitrary symbols or operations, and there isn't really significance to the numeric relations between successive values. Like in English, the sequence "def" is never followed by "g" despite what numeric extrapolation might suggest.
posted by aubilenon at 12:38 PM on August 30, 2017 [5 favorites]

That sounds like an opportunity! What discipline is this?

I'm curious too.
posted by Coventry at 7:30 PM on August 30, 2017

> That sounds like an opportunity! What discipline is this?

I'm curious too.

Astrophysics.

Whenever I try to raise awareness, I'm mostly seen as a stupid git at best in need of tolerance.

Maybe I am.
posted by runcifex at 10:42 PM on August 30, 2017 [1 favorite]

« Older Sunvault: the first English anthology to collect... | All The Stations Newer »

This thread has been archived and is closed to new comments

MetaFilter

The Bit Bomb
August 30, 2017 4:45 AM Subscribe

Tags

Share

The Bit Bomb August 30, 2017 4:45 AM Subscribe

Tags

Share

The Bit Bomb
August 30, 2017 4:45 AM Subscribe