Cat images reportedly unaffected
August 5, 2013 7:06 AM Subscribe

Xerox scanners/photocopiers randomly alter numbers in scanned documents
posted by a snickering nuthatch (108 comments total) 38 users marked this as a favorite

While this sounds like some kind of ominous copyright protection or counterfeit prevention feature, it's almost certainly just a bug in image processing. Great technical discussion on Hacker News
posted by Frayed Knot at 7:11 AM on August 5, 2013 [5 favorites]

There's no point in writing fiction set in the modern day, is there?
posted by The Whelk at 7:13 AM on August 5, 2013 [20 favorites]

Not if the printer's going to change the dates, at least.
posted by ardgedee at 7:15 AM on August 5, 2013 [22 favorites]

My workplace has a possessed Xerox Workcentre (photocopier of the year 2008 my butt!) 1st and I have noticed this on occasion along with letters being smushed enough - anything with "rn" tends to come out as an "m", for example.
posted by Calzephyr at 7:15 AM on August 5, 2013

I just can't imagine how huge this could be. The liability could be crushing.
posted by Goofyy at 7:18 AM on August 5, 2013 [8 favorites]

Because copying documents accurately is something out of the future.
posted by swift at 7:21 AM on August 5, 2013 [1 favorite]

Photocopying is something that can be done without computer assistance at all. That's something computers have added to this process, incredible new ways to have bad output.
posted by JHarris at 7:25 AM on August 5, 2013 [23 favorites]

OMG photocopies are not as clear as originals?!?!!!? Next you'll be telling me that those cassette recordings I made off the radio aren't indistinguishable from the original masters!
posted by yoink at 7:25 AM on August 5, 2013

You should maybe look at the article, yoink, it's not just clarity. There are clearly switched & duplicated numbers.
posted by echo target at 7:27 AM on August 5, 2013 [38 favorites]

OMG photocopies are not as clear as originals?!?!!!? Next you'll be telling me that those cassette recordings I made off the radio aren't indistinguishable from the original masters!

Didn't read the article, huh?

This is a bug caused by computers attempting to FIX the issue of poor copy readability and actually making it far, far worse.
posted by showbiz_liz at 7:29 AM on August 5, 2013 [14 favorites]

It's not a question of clarity, it's an issue of clearly printed wrong numbers. The copier is trying to be clever and actually read the text to re-print it clearly. It's more akin to uploading your radio show to youtube and allowing the content to be determined by the automatically-generated subtitles.

Except in that case, no one is going to use your robo-captioned radio rip to determine the health of a corporation... or an individual.
posted by Mr. Anthropomorphism at 7:30 AM on August 5, 2013 [2 favorites]

Yeah pipe down it's just an artifact. The examples are sixes becoming eights; that sort of thing happens because 6 and 8 are very similar. It won't change a 4 to a 5 because this machine doesnt recognize characters, it just reproduces the image on the page (with a certain amount of noise, obvs).
posted by Mister_A at 7:30 AM on August 5, 2013

Oops looks like maybe there is more to the story!
posted by Mister_A at 7:31 AM on August 5, 2013 [3 favorites]

The examples are sixes becoming eights; that sort of thing happens because 6 and 8 are very similar. It won't change a 4 to a 5 because this machine doesnt recognize

So, the example where it changed a box with the number 14 in it to one with the number 17, and the one with 21 into 14, fits into that theory how?
posted by eriko at 7:33 AM on August 5, 2013 [9 favorites]

It's sort of cute, like the scanner really wants to help! but doesn't know how to read or do anything correctly
posted by theodolite at 7:34 AM on August 5, 2013 [33 favorites]

Correct link to Hacker News discussion.
posted by Pyrogenesis at 7:35 AM on August 5, 2013 [2 favorites]

Photocopying is something that can be done without computer assistance at all. That's something computers have added to this process, incredible new ways to have bad output.

In other words, "To err is human, but to really screw something up you need a computer."
posted by TedW at 7:35 AM on August 5, 2013 [5 favorites]

as per usual consider reading the article or at least a handful of the other comments.
posted by jessamyn (staff) at 7:35 AM on August 5, 2013 [32 favorites]

For anyone who is still not clear, basically Xerox decided to use a compression algorithm called JBIG2 which finds "symbols" in the image and attempts to group them into duplicates. So it tries to find all of the 7s in the document, keeps a copy of one, and repeats that copy throughout the document. It is not hard-coded to know exactly what a 7 looks like in the target font, it just groups things that look the same together, and in a scanned document all of the 7s are going to look pretty similar. The problem is that it looks like in a lot of cases it's for example grouping a bunch of 1s, a few 7s, and a few 2s together and is replacing them all with 1. That's a problem for a document full of numbers.
posted by burnmp3s at 7:37 AM on August 5, 2013 [22 favorites]

yoink: "OMG photocopies are not as clear as originals?!?!!!? Next you'll be telling me that those cassette recordings I made off the radio aren't indistinguishable from the original masters!"

Pipe down, you internet yokel. It's not just about fidelity -- there's a problem with the device deciding "EHHhh, this looks good enough. "

I can't imagine what kind of gains JBIG2 offered in operational speed or storage that would have made this OK. Also kinda curious now to see what kind of testing one of these devices goes through before sign-off. I would have thought some kind of test pattern/ halftone test/ etc would have been employed.
posted by boo_radley at 7:43 AM on August 5, 2013 [4 favorites]

The author approves of the title of this post on Twitter: "Nicely put, @metafilter: "Cat images reportedly unaffected". - http://tinyurl.com/pv8g3nu :-)."
posted by jetlagaddict at 7:44 AM on August 5, 2013 [15 favorites]

It's astonishing a bug this fundamental would ship in a Xerox product. This isn't some edge case; it's a flaw in the basic design of the image processing software. Maybe it only shows up in certain size images of certain degraded quality? Still the fact that it happens at all is terrible. It seems mostly an artifact of the way copiers aren't copy-ers anymore, they're combo scanners and printers. Only in this case the scanner works fine, at least in OCR mode; it's only the bad compression algorithm for copies. Weird.
posted by Nelson at 7:47 AM on August 5, 2013 [2 favorites]

MetaFilter: Pipe down, you Internet yokel.
posted by one more dead town's last parade at 7:47 AM on August 5, 2013 [25 favorites]

Perhaps the commenters who misunderstand the article are reading a scanned copy, not the original.
posted by ogooglebar at 7:48 AM on August 5, 2013 [64 favorites]

ITS THE NSA!!11
posted by goethean at 7:48 AM on August 5, 2013

I have never been so glad to have been laid off as I am today.

My previous employer was a mid sized law firm which relied on the Xerox WorkCenter copiers of that model for their scanning. And they were forward looking and trying to reduce paper in the office so they scanned just about everything and then stuffed the original into a warehouse (fortunately I suppose they didn't shred the original).

In light of this, if I still worked there, today would be an absolute nightmare. We'd be checking to see of the desktop scanners had similar problems, we'd be discussing plans to try and get the originals re-scanned in a way that wouldn't cause this. The next many weeks would be filled with excitement, chaos, and long hours.

But, since they decided to switch to a paid per incident managed computer support solution and my job was therefore redundant and eliminated, I'm having a rather normal day.

For those of you in a computer job where this is going to be a big issue, you have my condolances. This is the stuff of which nightmares are made. Also: WTF was Xerox thinking?
posted by sotonohito at 7:48 AM on August 5, 2013 [17 favorites]

ITS THE NSA!!11

Actually, I think it's a perfect opportunity for some Tuttle/Buttle madness, with guys with guns added for good measure! The specific tie-in is the 60% accuracy number from the NSA/law-enforcement thread (but I'm sure the data we use to target terrurists for drone strikes is 140% accurate, to make up for the low domestic accuracy).
posted by spacewrench at 7:54 AM on August 5, 2013 [4 favorites]

It's astonishing a bug this fundamental would ship in a Xerox product.

I argue it's not a bug. The compression algorithm is operating within spec. Bugs are when you write something to do X and it does Y.

This is a design error. They picked a compression method poorly suited to the job. JBIG2 operated with in the spec -- it's the fact that it was chosen that was so wrong.

If you implement your file save routine with "rm", well, it's not the fault of rm that you aren't saving any files.
posted by eriko at 7:56 AM on August 5, 2013 [14 favorites]

“A computer lets you make more mistakes faster than any other invention with the possible exceptions of handguns and Tequila.” ― Mitch Ratcliffe.
posted by Pope Guilty at 7:57 AM on August 5, 2013 [20 favorites]

This is exactly the kind of non-linear failure modes I was talking about in the "computer-driven car" thread. Things getting fuzzy is a linear error. Very clear, discrete mistakes is a non-linear error.

When humans make driving errors, they are usually of the linear kind. Computer errors are a whole nother animal and our streets, our laws and our minds are not set up to handle them and some of those things may never be.
posted by DU at 7:57 AM on August 5, 2013 [4 favorites]

Well, so far, computers are actually safer drivers than humans, but let's not let facts get in the way of Luddism.
posted by Pope Guilty at 7:58 AM on August 5, 2013 [2 favorites]

SUVs are also safer...for some.
posted by DU at 7:58 AM on August 5, 2013

Not very important, compared to the larger issue. Numbers decay, like radioactive elements. When you put a 347 in your document, over enough time it naturally produces an 89, a 48, and three 7s. This has been hushed up for a long time by Big Math.
posted by thelonius at 7:59 AM on August 5, 2013 [49 favorites]

naturally produces an 89, a 48, and three 7s.

Well, you're forgetting the -.00001/+.00001 pair it emits, but at normal energy levels, that's basically noise.
posted by eriko at 8:01 AM on August 5, 2013 [21 favorites]

For those who haven't yet read TFA it's definitely worth looking at the examples. It's not something that a typical user would identify as a bad photocopy, and that's the crux of the whole issue.

Photocopiers have been around long enough that most people have a pretty good understanding of how they fail -- not in the sense of break down completely, but in how, if you make a copy, and a copy of that copy, they will slowly become more difficult to read and eventually be totally illegible. They are, at the end of the day (since they produce printed output on a page) basically an analog process and there's degradation / generation loss. Sometimes you get more catastrophic problems, like smearing or smudging or too-light / too-dark copies, but you can pretty much glance at the output and see what's wrong.

However, the newer digital photocopiers have very little of their internals in common with the old photostatic machines that some of us remember, and which set user expectations concerning what photocopiers actually do. The digital copiers have a range of failure modes that don't exist on analog machines. The problem described is one of those.

The digital machines are in some ways worse than the old analog ones in this regard, because when an analog machine chews up your data, the result is predictable and obvious. It would be very hard for an analog machine to subtly change only a few digits in the output here and there; at worst you might get a blotch that would erase or obscure part of the image, but that's about all that I've ever seen. But a digital machine can easily mess up part of the image but leave the rest of it crystal-clear.

Part of the problem here is an interesting technical problem with the copier's software, but there's another broader problem in making devices that superficially resemble things that we're familiar with, but are completely different "under the hood", to the point where our intuition about how they work and how they break down is totally wrong.
posted by jtuttle at 8:01 AM on August 5, 2013 [49 favorites]

Numbers are just, like, things, man. Spirits. Floating in the universe. A number isn't just something for the patriarchy to cage and define, to pin to some styrofoam board like a dead insect. Freedom, man.
posted by chasing at 8:05 AM on August 5, 2013 [2 favorites]

Oooof: To be most clear: #Xerox was told about the issue 1 week *before* I published my blog post. I'm not an asshole.
posted by boo_radley at 8:06 AM on August 5, 2013 [4 favorites]

SUVs are also safer...for some.

SUVs sacrifice the safety of others for the protection of the driver and passengers. Computer driven cars thus far don't have accidents at anything resembling the rate of human drivers. You want this to be some sort of thing where computer driving is inherently unsafe. You are completely opposite reality.
posted by Pope Guilty at 8:07 AM on August 5, 2013 [4 favorites]

I'm not able to find single news outlet reporting this yet? Should I short-sell some Xerox stock?
posted by zeikka at 8:07 AM on August 5, 2013 [2 favorites]

jtuttle That is a remarkably clear, concise, and widely applicable description of a general, and likely to be growing as time passes, problem.

Offhand I can think of a few other examples of technology which, on the surface, appears to work the same as it used to, but in reality does not. Phones, for example, no longer send analog representations of your voice, instead they compress what you are saying in realtime, transmit that as binary data and it is decoded and played back on the other end. There hasn't yet been a problem with this, but if one day there is it'll be as counterintuitive and bizarre seeming as the Xerox problem is.

Again, bravo for your exquisite statement of the general problem.
posted by sotonohito at 8:13 AM on August 5, 2013 [1 favorite]

IT support just called. They said this is clearly user error and the fix for this is to use a different typeface in your documents, and to not use certain numerals.
posted by Thorzdad at 8:15 AM on August 5, 2013 [23 favorites]

Part of the problem here is an interesting technical problem with the copier's software, but there's another broader problem in making devices that superficially resemble things that we're familiar with, but are completely different "under the hood", to the point where our intuition about how they work and how they break down is totally wrong.

Skeuomorphism in hardware.
posted by jaduncan at 8:15 AM on August 5, 2013 [9 favorites]

jtuttle, I like your point about digital failures being harder to understand and predict complex digital fidelity failures. It's like how in the long long ago the worst that happened on a bad phone line was some static. Now it's this maddening stuttering.

Another example: video compression glitches. Like this beautiful Mad Men glitched video. Or glitched GIFs. A long way from the TV snow you get when the rabbit ears are pointed the wrong way.

The particularly bad thing about this Xerox bug is the resulting images don't look like errors at all.
posted by Nelson at 8:15 AM on August 5, 2013 [2 favorites]

Phones, for example, no longer send analog representations of your voice, instead they compress what you are saying in realtime, transmit that as binary data and it is decoded and played back on the other end. There hasn't yet been a problem with this, but if one day there is it'll be as counterintuitive and bizarre seeming as the Xerox problem is.

Sure, there is a problem with this. If you've ever listened to hold music on a cellphone, it sounds like it's being broadcast on AM radio from the bottom of a fishtank. But listen to the same hold music on a landline and it's clear as day. The voice-optimized compression algorithms are terrible for sounds that aren't human voices.
posted by stopgap at 8:17 AM on August 5, 2013 [6 favorites]

The voice-optimized compression algorithms are terrible for sounds that aren't human voices.

I think the point is there may be a time in the future when it's not just compress-then-decompress but some sort of guessing algorithm that actually tries to make your decompressed voice sound clearer by basically making highly statistically supported guesses at what words you were trying to say. And so someone who, for all intents and purposes IS you, is then saying words that you never said because a computer was in the middle of the transaction, "helping."
posted by jessamyn at 8:28 AM on August 5, 2013 [20 favorites]

¡Viva Markov Cheney!
posted by changoperezoso at 8:33 AM on August 5, 2013 [1 favorite]

Jessamyn: I would be surprised if anyone was dumb enough to do that just for the sake of voice clarity, because the pitfalls are so obvious, but I would be less surprised to see it as part of a filter that's designed to strip out background noise when you're talking on speakerphone or something like that.

Of course, at the same time, I'm surprised Xerox was dumb enough to think this was a good idea, so.
posted by Holy Zarquon's Singing Fish at 8:35 AM on August 5, 2013

You only had one job!

(╯°□°)╯︵ ┻━┻
posted by starman at 8:37 AM on August 5, 2013 [35 favorites]

One time I scanned my butt and the Xerox output was J-Lo's butt. Crazy, man.
posted by Mister_A at 8:38 AM on August 5, 2013 [2 favorites]

I would be surprised if anyone was dumb enough to do that just for the sake of voice clarity, because the pitfalls are so obvious...

You mean like on-the-fly word correction?
posted by Thorzdad at 8:38 AM on August 5, 2013 [4 favorites]

jessamyn: I think the point is there may be a time in the future when it's not just compress-then-decompress but some sort of guessing algorithm that actually tries to make your decompressed voice sound clearer by basically making highly statistically supported guesses at what words you were trying to say.

Or maybe tries to fill in gaps in your speech caused by momentary signal problems.
posted by Rock Steady at 8:40 AM on August 5, 2013 [1 favorite]

There are codecs out there for very-low-bitrate audio; they analyze the audio into phonemes, send the data, and re-synthesize the audio on the far end:

VLBR-SPEECH CODING demonstration page, 2012

Not gonna fool anyone anytime soon, but there are legitimate applications of this for audio- particularly in remote / military uses.
posted by jenkinsEar at 8:43 AM on August 5, 2013 [1 favorite]

One time I scanned my butt and the Xerox output was J-Lo's butt. Crazy, man.

A significantly altered figure right there, I would assume.
posted by jaduncan at 8:45 AM on August 5, 2013

Thorzdad: Well in that case the user gets instant feedback about what happened. That wouldn't work for voice though (see the various speech jamming apps around).
posted by edd at 8:46 AM on August 5, 2013

Or maybe tries to fill in gaps in your speech caused by momentary signal problems.

If this sounds anything like iPhone or Swype autocorrect work, those could be some interesting phone calls. I'm creating the inevitable Tumblr now to get ready.
posted by jennaratrix at 8:46 AM on August 5, 2013 [1 favorite]

One of the stated purposes of JBIG2 is archival (it's one of the accepted compressors for PDF), so this seems like a design error in the standard itself that an implementation could be that flawed and still be called "JBIG2" (are there *no* standard tests for this?). In fact, the doc I linked to discusses loosely verifying the integrity of a JBIG2 compressor that uses "pattern match & substitute" by using OCR recognition rates (seems awfully indirect to me) *and* an example where 'a' is swapped with 'o' in a bad compressor (uhhh...)

Xerox has some major QA issues to address, I expect...
posted by smidgen at 8:54 AM on August 5, 2013 [1 favorite]

This is a design error. They picked a compression method poorly suited to the job. JBIG2 operated with in the spec -- it's the fact that it was chosen that was so wrong.

What other possible use case could JBIG2 have? It seems to be designed specifically to compress documents. The issue is that it (apparently) does this poorly.

There are codecs out there for very-low-bitrate audio; they analyze the audio into phonemes, send the data, and re-synthesize the audio on the far end:

This idea pre-dates the computer age, actually.
posted by neckro23 at 8:56 AM on August 5, 2013 [1 favorite]

I'm creating the inevitable Tumblr now to get ready.

Part of the issue with speech too is the inability to correct it as it's going. I have an ongoing war with Siri where it's basically a crapshoot if the app understands that when I say [dӡIm] I am trying to text my boyfriend and not the gymnasium. Despite the fact that I have never texted the gymnasium and I text my boyfriend multiple times per day. The system's failure to learn while being touted as some sort of "smart" solution to anything makes me somewhat cross-eyed.

Now that I have rtfa. How do you even fix stuff like this? I assume you can do firmware upgrades? Maybe?
posted by jessamyn at 8:58 AM on August 5, 2013 [1 favorite]

A bit of background on Xerox's test team:

In order to forgo benefits or dealing with the actual hiring process, all hiring for test engineers is done through a temp agency where they hire you for just long enough (18 months in Rochester, NY) so that you're not a full time employee, and then lay you off just long enough (3 months) so that they can rehire for the same period again. Some people are in this loop for about 10 years or so, though it seems most of the people are fairly new.

Meanwhile, the engineering team was sold off to an Indian IT company HCL so these temp workers do not work for the same company as the full time people they answer to.

And of course the actual testing, since it really involves reading a sheet and printing some things out and printing out whether it is okay or not, is largely staffed by low paid just out of high school or college summer intern (from non-tech fields) people who just press the buttons till they get out of there with no accountability on quality.

In disclosure I had a week stint through a temp agency trying to work there and happily am no longer there. I occasionally saw my 'boss' attempting to use some basic QA database program where he'd get flustered and have to go find some higher up to explain how the software worked. It surprises me that they can push any product out the door, though I suppose I noticed a few veteran PLC programmers who were probably pulling the weight for a lot of the company.
posted by kigpig at 9:07 AM on August 5, 2013 [40 favorites]

Just last week I read that Google would love to do live translation for phone calls, essentially putting their Babel Fish in the middle of conversations where the callers lack a common language. My dad & I were talking about it this weekend, and we marveled at the idea that neither of you would ever know if and what kind of errors might be introduced. Hilarious! And terrible!

Paywall original interview at London Times, or read the Mashable piece that's free to read: http://mashable.com/2013/07/26/google-universal-translator/

(I hate Xerox gear, and reading this has been difficult thanks to all the laughing Nelson Muntzes and Sad Trombones.)
posted by wenestvedt at 9:11 AM on August 5, 2013

If you've ever listened to hold music on a cellphone, it sounds like it's being broadcast on AM radio from the bottom of a fishtank. But listen to the same hold music on a landline and it's clear as day. The voice-optimized compression algorithms are terrible for sounds that aren't human voices.

As someone I don't remember pointed out, we've gone from "you can hear a pin drop" to "Can you hear me now?"
posted by dirigibleman at 9:15 AM on August 5, 2013 [12 favorites]

That is really interesting. According to hacker news this particular issue has been seen before in google books. It is because of the way JBIG2 compression with segments work.

Seems like in order to reduce size, they split the image into segments and reuse individual segments as many times as possible as sort of Run Length Encoding.

The class of bug even has a name, they call it a "contoot" because they same issue replaced the word contents with contoots in many Google Books.

There has got to be a way to turn the segment based compression off.
posted by Ad hominem at 9:16 AM on August 5, 2013 [4 favorites]

smidgen: "One of the stated purposes of JBIG2 is archival (it's one of the accepted compressors for PDF), so this seems like a design error in the standard itself that an implementation could be that flawed and still be called "JBIG2" "

Once you get to "The power behind JBIG2 technology is its ability to support both lossless and perceptually lossless black and white image compression.", there's more to go on. Maybe Xerox was using lossy. Suppose JBIG uses a straight dictionary of 8x8 bitmaps (or whatever size), and the implementer can define the size of the dictionary. If it's too small, the algorithm will be forced to compromise clarity, but it will still be considered to be a good working implementation. It's running under constraints that it's designed for and producing consistent output that follows the rules.

Reading a little more into the JBIG2 site, we come across the web optimization page, we have this section (emph. mine):

Utilizing a JBIG2 Encoder with No Information Loss

The JBIG2 specifications caution against using halftoning since this operation can seriously degrade the image. Similarly, the specs caution against using lossy JBIG2, which can introduce mismatches that degrade document quality, readability, and recognition rates.

The best way to understand such cautions with respect to JBIG2 is that the quality of the JBIG2 encoder is crucial. After all, image thresholding is potentially much more degrading than either font learning or halftoning since the part of the image that needs to be retained, e.g., signature, may disappear entirely. Yet most corporations capture their documents to black and white because they trust that the thresholding function is reliable, and that essential information in the image document will be preserved. They probably also test this assumption before putting their document imaging system into production.

(...)

It is important to understand that effective JBIG2 compression, with compression rates 5x-10x smaller than TIFF G4 (or standard G4-basedPDF), cannot be achieved with lossless JBIG2. Moreover, once a lossy JBIG2 encoder is utilized, it becomes essential for the IT manager or project leader responsible for document integrity to ensure that the JBIG2 converter supports perceptually lossless conversion.

posted by boo_radley at 9:19 AM on August 5, 2013 [3 favorites]

Someplace in the Xerox codebase there is a comment reading TODO: CHANGE COMPRESSION ALGORITHM
posted by shothotbot at 9:20 AM on August 5, 2013 [2 favorites]

...I would be less surprised to see it as part of a filter that's designed to strip out background noise when you're talking on speakerphone or something like that.

Or when the crime techs are assembling evidence in your terror-conspiracy case...
posted by Kirth Gerson at 9:22 AM on August 5, 2013

I think the point is there may be a time in the future when it's not just compress-then-decompress but some sort of guessing algorithm that actually tries to make your decompressed voice sound clearer by basically making highly statistically supported guesses at what words you were trying to say. And so someone who, for all intents and purposes IS you, is then saying words that you never said because a computer was in the middle of the transaction, "helping."

There is a similar plot point in Vernor Vinge's "A Fire Upon the Deep" (not much of a spoiler, since one of the protagonists worried about the problem right away) - over a low-bandwidth lossy communication channel the attackers just needed to mimic an over-simplified heavily-compressed intermediate data format, which the receiver's computers would helpfully reconstruct into convincingly detailed video + audio based on previous recordings from the same sender pre-attack.

No point in writing modern fiction indeed.
posted by roystgnr at 9:24 AM on August 5, 2013 [6 favorites]

This would explain a lot of what seems to go wrong with those credit reporting agencies...
posted by jeffamaphone at 9:27 AM on August 5, 2013

I have an ongoing war with Siri where it's basically a crapshoot if the app understands that when I say [dӡIm] I am trying to text my boyfriend and not the gymnasium. Despite the fact that I have never texted the gymnasium and I text my boyfriend multiple times per day.

Maybe if you pronounced "gym" [gaim], akin to gynecologist. Homer Simpson approved!
posted by explosion at 9:33 AM on August 5, 2013

Jessamyn: Why don't you rename your contact "Gymnasium" to "Fitness Center"?

I bet that'd cut down on the Siri speech-rec collision frequency a bit ...
posted by cstross at 9:34 AM on August 5, 2013

I wonder how long until they start applying high-level semantic compression algorithms to photographs or video, with the bizarre and hilarious bugs that will ensue. Imagine, for example, seeing a picture of yourself on some social photo-sharing site, only sporting a set of bushy eyebrows you've never had. The eyebrows came from the facial-model dictionary of the image compression/rendering algorithm used.
posted by acb at 9:39 AM on August 5, 2013 [1 favorite]

I have replaced all homonyms in my iphone database with pronounceable, non-word three syllable and it works well for me.
posted by shothotbot at 9:41 AM on August 5, 2013 [2 favorites]

That is basically what I did once I realized that Siri wasn't actually going to learn anything ever.
posted by jessamyn at 9:44 AM on August 5, 2013 [3 favorites]

An interesting thing about Vinge's AFUtD and related works is that he was actually writing about contemporary technology like USENET and the early internet, the surprising difficulty of AI, the Y2K problem, ubiquitous surveillance, the autistic spectrum, and other contemporary issues. He wraps it in far future space opera, but he's inspired by current trends and technologies.

So this may be based on something current to the time, not SF.
posted by jclarkin at 9:45 AM on August 5, 2013 [3 favorites]

I don't think this is interesting, this is frightening. Think of all the copying/archiving/scanning/pdfing that occurs in the engineering world. "Hey can you send me the updated spec for that girder we're building today, the one for the big ass bridge?" "Sure let me scan this real quick. Check your email in five minutes."

This is potentially a very major issue, and for things that have already been set in motion. I'm not worried about your bills being correct, I'm worried about stuff falling down.
posted by Big_B at 9:48 AM on August 5, 2013 [18 favorites]

Ok, reading the linked article, it does not give a good history of how long this "bug" has existed.

I am curious, though, mainly because of a political/social ramification. Anyone remember Rathergate?

Wasn't the authenticity of the documents questioned rather harshly by critics and the documents were analyzed by thousands of arm-chair font experts. What if, since I highly doubt that the original documents were handed over to the CBS producers, the documents were copied by a Xerox copier, and this character replacement error was introduced. According to the wikipedia entry:
"The authenticity of the documents was challenged within hours on Internet forums and blogs, with questions initially focused on alleged anachronisms in the documents' typography and content soon spreading to the mass media"

There's even an animated gif in the wikipedia entry showing how said original would look, versus something typed up in Word. So what if, instead of a forgery, it was actually something caused by the copier trying to "help" and thus, rendering the copies questionable, because it wasn't an actual duplicate, it was altered by the machine that made the copies, not by human malfeasance?

I know it's a bit of a stretch, but, um... I find the idea interesting.
posted by daq at 9:51 AM on August 5, 2013 [4 favorites]

I don't mean it is cool. I mean from a technology standpoint it is an interesting failure mode. I've working on stuff like this in the past in a much more narrow field and we hired professional proofreaders to "full read" thousands of documents we had processed.
posted by Ad hominem at 9:52 AM on August 5, 2013

Also, as someone who used to have to try and get OCR software to work to create PDF forms from scanned printed documents, the error rate of that stuff is ridiculous. I know it's a different technology, but man, the number of times I had to tell my client "look, it would be easier and faster to just have someone manually create the forms in Acrobat, and simply type in the data, than to rely on 100% accuracy and legibility of the scanned documents". People are better at catching pattern errors in human readable information. Hence, Amazon Mechanical Turk.
posted by daq at 9:58 AM on August 5, 2013 [1 favorite]

I think the point is there may be a time in the future when it's not just compress-then-decompress but some sort of guessing algorithm that actually tries to make your decompressed voice sound clearer by basically making highly statistically supported guesses at what words you were trying to say.

Or maybe tries to fill in gaps in your speech caused by momentary signal problems.

The mathematics of compression and prediction are inextricably linked. It would be a fairly incremental change to existing methods to do what you both are talking about.
posted by a snickering nuthatch at 10:12 AM on August 5, 2013 [2 favorites]

It seems like the issue is that JBIG2 works with really big features (since you can still tell what the letters are) and works with really small features (since it's illegible anyways) but has a big hole in the middle where you have features that are just on this side of legible, but that get swapped out for different ones. I don't really know how you can fix this on something like a poor quality scan except to get to the pixel level, and now it's a completely different kind of compression.
posted by smackfu at 10:27 AM on August 5, 2013

@ daq - Rathergate was before digital copiers, I believe.

And besides, the guy pushing out the supposed 'memo' stated he'd taken the memo and retyped it, then copied that and destroyed the original memo.

Which, I'd think, turns it from something verifiable into sheer fiction.
posted by JB71 at 10:32 AM on August 5, 2013

Oh jeez, be aware that Adobe Acrobat also uses JBIG2 as an optional compression method for B&W scans. I don't know how good Adobe's implementation is, but I'm not using it any more. Anyway, a solution for this problem is higher rez, which should not be a problem for greyscale documents. I almost always scan documents at 400dpi in bitmap, which produces incredibly small files with no lossy compression.
posted by charlie don't surf at 11:29 AM on August 5, 2013

I think the point is there may be a time in the future when it's not just compress-then-decompress but some sort of guessing algorithm that actually tries to make your decompressed voice sound clearer by basically making highly statistically supported guesses at what words you were trying to say. And so someone who, for all intents and purposes IS you, is then saying words that you never said because a computer was in the middle of the transaction, "helping."

So basically...autocorrect for voice messages rather than just for text messages? Because I'm sure nobody here has ever seen an example of how autocorrect can mess things up spectacularly.
posted by mstokes650 at 11:47 AM on August 5, 2013

This is a classic clash between providers and users of technology.

In the PDF sphere, when images are stored within the PDF there are a number of compressions that are available:

ASCIIHex
ASCII85
None
Run Length Encoding (aka, RLE, PackBits)
LZW/Flate
DCT (aka, JPEG)
CCITT (aka, Fax)
JBIG2
JPX (aka, JPEG2000)
Crypt

The ones in bold are never used for images directly as they don't actually do any compression and in the case of the first two, always increase the amount of data and in the case of the last sometimes increase it.

Within the useful compressions, you're now entering into certain restrictions or limitations of the compression or both. Here is what they are briefly:

None - works on everything, saves no space; rarely used
RLE - works on every format, but works best on images with a single channel (ie, gray scale or black and white) or palletted images. Even then, it doesn't work particularly well as the best it can manage is something like 63:1, but it rarely does - more like 5:1 for typical documents. Super fast. Supported by all version of Acrobat. It is lossless.
Flate - flate is a funny compression in that in PDF it's not just flate, image data could be run through any of a set of filters to make the image data more conducive to compression, for example the simplest stores the first color in each row and then the differences between colors per pixel. The compression tends to do much better with these. The theoretical is something like 1200:1, but in reality you get something like 6:1 up to 20:1 for typical images. It works on color (rgb and cmyk), palletted and gray images. It works on black and white images, but not particularly well. It is lossless and supported by all versions of PDF. It is moderately fast.
DCT - JPEG compression supports gray, color, and in theory cmyk images (in theory, because the support I've seen in generation software is spotty). JPEG compresses very well compared with the previous codecs and can compress as well as Flate with no visual loss and can compress 50:1, give or take, depending on the "quality" setting. JPEG is craptastic for text as anything with a high visual frequency gets all kinds of interesting artifacts. JPEG is consider lossy, although there is a setting by which JPEG is lossless. In spite of this knowledge being well-disseminated, it doesn't stop idiots from doing things like compressing text documents with it. In one case, I know of a person who was given the task of archiving a set of legal documents and reducing the total storage so he blindly compressed all the documents with JPEG and settings that were great for size but bad for fidelity, rendering these documents (and for the sake of the curious, let's call them 'evidence') unreadable. It is supported by all versions of PDF. In the first version of Acrobat, it was the slowest by far.
CCITT - is a decent enough compression that works only on black and white images. CCITT is meant for handwritten or typed documents and with that domain will give 8:1, give or take, for the group 3 variant and up to 16:1 for the group 4 variant. CCITT is lossless and very snappy. Probably faster than flate but slower than RLE. It is supported by all versions of PDF.
JBIG2 - JBIG2 offers both lossless and lossy compression of black and white images only. It is meant for images with small recurring symbols (aka, Roman-ish text). When operating in lossy mode, it slices and dices the image and searches for two dimensional blocks that are "close enough" and stores a representative one as a single symbol, remembering where it was in the source. It then removes it, essentially leaving an easy to compress white spot behind. It can compress around 50:1 for most typical documents. It is lossy. It is pokey to compress and somewhat to decompress. It is only supported in PDF 1.4 and above. It is allowable in PDF/A.
JPX - The more modern version of DCT compression. JPEG2000 is an odd bird. It can do lossless and lossy compression and can get very high compression ratios, but it has to be tuned properly. In theory, there is a setting wherein you specify a limit on the compressed data size and it will try to honor that. In practice, it's hard to get it to actually do that well and there seems to be a tipping point where it quickly goes from OK to pure crap and that tipping point depends on the image. It works on grayscale and color (rgb, rgba, and cmyk), but isn't meant for black and white or palleted. It is pokey. It is only allowable in PDF 1.5 and above, making it unsuitable for archival documents (disallowed in PDF A/1b, but allowed in PDF A/2).

Given all of that, a fair proportion of the people on MeFi will read that an understand most of it and be able to choose appropriate settings for any given image. The proportion of people who create/consume PDF who understand that (or even understand that there is an issue is far smaller.

Enter the scanner manufacturer. They have discovered a number of things about users of scanners, especially business users: they're idiots. They can't deal with or absolutely hate most scan software (and with reason - most scan software is pure shit. HP is one of the worst offenders - the download for my the driver for my scanner/printer is 362.7MB and it works about 1 in 3 times, fixable by rebooting my computer, where as VueScan works 100% of the time with my scanner and is an 8M download. An engineer at my company wrote a nice piece of software called Inspector Twain, which examines scanners and scanner drivers for spec compliance (SPOILER: they're not very good)). So the main kind of scanner driver, TWAIN, defines the ability for the scanner to transfer a file of scanned data rather than just pixels. This is great, because when there is more horsepower available in the scanner, they can build in code to generate TIFF (disliked by typical users) or PDF (disliked by fewer users). Somebody at Xerox had PDF as a checkbox and put it into the scanner, and since Xerox is part of the JBIG committee, you can bet that they have free access to that and embedded that too.

Smaller files that are delivered in standard formats is a marketing/sales point. It's in a blurb from Xerox right here. The problem is that in implementing this feature, they don't have a particularly good way to turn it off. Maybe there is - I read through the user manual and system admin manual and I don't see anything that would appear to do that, although while it seems like there is templating system under the hood that could be manipulated into changing those settings, they didn't really go out of their way to document it.

It comes down to complicated choices that were informed by a marketing decision and hidden behind friendly menus that hide the complication.

When I read the story my first thought was "oh, some idiot didn't know how to set JBIG2 properly - hope s/he put in a menu somewhere."

Here are two choice quotes from our engineers on this:

"[it's] just conflating small connected regions whose differences fall below a threshold. Which is the core algorithm of lossy JBIG2.... I love it: The wikipedia page for JBIG2 is already updated with this issue."

"When you consider this particular issue stems from something close to a 8px x 16px glyph and the difference between correct and incorrect glyph is roughly 4px it isn't so hard to see how the lossy algorithm gets confused. Still, definitely a warning to know the tools you're using."

In the PDF generation library I wrote for my company, you have to take specific steps to use JPEG2000 or JBIG2 and not just because they are add-on products, you need to think first before you use them.
posted by plinth at 12:15 PM on August 5, 2013 [118 favorites]

When I read the story my first thought was "oh, some idiot didn't know how to set JBIG2 proper

My impression is that idiot works for Xerox. It's great you understand all this fancy-schmancy encoding stuff. But it's ridiculous to ask ordinary users to understand it too. There's a button on the copy machine that says "copy". You press it and it makes a copy. That's what a copy machine does. The fact that some Xerox team mis-tuned their JBIG2 implemention doesn't matter, what matters is the copy button doesn't really copy.
posted by Nelson at 12:26 PM on August 5, 2013 [8 favorites]

Nelson: "My impression is that idiot works for Xerox. It's great you understand all this fancy-schmancy encoding stuff."

Clearly, plinth believes the idiot's at xerox, too. From the sentence you quoted: " hope s/he put in a menu somewhere." An end user -- the copy-button pusher -- can't do that. Only the product team can. And that menu would be for maintenance only, not something some POWER USER would be jamming on.
posted by boo_radley at 12:33 PM on August 5, 2013 [4 favorites]

And honestly, having poked around PDF generation a bit, I think this story is just a small chapter in a big book titled "PDF: What a Bastard".
posted by boo_radley at 12:35 PM on August 5, 2013 [22 favorites]

a small chapter in a big book titled "PDF: What a Bastard"
Trust me, you have no idea just how big that book is (Hint: the chapter on the color space data structures would be longer than a typical Neal Stephenson novel and can be summed up with four words: "what were we thinking?" (I say 'we' because I worked on Acrobat 1.0 (and several later versions), but that's slightly unfair since I didn't work on the standards committee).
posted by plinth at 1:04 PM on August 5, 2013 [17 favorites]

Symbol copy errors? Maybe the documents are evolving.
posted by hanoixan at 1:47 PM on August 5, 2013 [1 favorite]

This problem would be more serious if you were beaming down to a planet.

Teleport mechanism: *Hmmm. Eight fingers? That seems odd. Oh well.*
posted by Twang at 3:38 PM on August 5, 2013 [1 favorite]

Following on from plinth's comment, there's a good chance that some network/printer admin somewhere decided to frob that JBIG2 settings for tiny files, not realising the mayhem it would cause later.
posted by scruss at 3:42 PM on August 5, 2013

You only had one job!

(╯°□°)╯︵ ┻━┻

You on!u hod one job!

(╯o□o)╯︵ !━!

Yoo on!o hod ono joh!

Y oooY ︵ !︵!
posted by BrotherCaine at 5:12 PM on August 5, 2013 [46 favorites]

yoko ono had one john
posted by jenkinsEar at 8:35 PM on August 5, 2013 [30 favorites]

(picks up phone... landline, natch...) Hello, operator? Gimme my broker... Jimmy? Yeah, me. Sell all the Xerox stock. Come to think of it, get rid of any copier stock. Yeah, NOW. (click)
posted by drhydro at 11:36 PM on August 5, 2013

No reports of this at the BBC, Guardian or NYT yet. The thought of all the lawsuits alone is staggering; this could be the end of the Xerox brand, if not the underlying business.

There's a Xerox WorkCentre 7535 sitting a few yards from my office. I'll certainly think twice now before photocopying anything with it.

Yep, if I had any Xerox stock right now I wouldn't for long.
posted by rory at 2:22 AM on August 6, 2013

Pope Guilty: "Well, so far, computers are actually safer drivers than humans"

I don't disagree with this statement but as I understand it, more and more cars are becoming "drive-by-wire" instead of mechanical connections, correct? So we are mixing human driving with electronic driving already. And suppose a critical bug like this Xerox bug got into a lot of cars, it could be disastrous. In the sense of drive-by-wire there are a lot more computer-controlled (to an extent) cars on the road than just Google's.
posted by IndigoRain at 2:23 AM on August 6, 2013

I don't disagree with this statement but as I understand it, more and more cars are becoming "drive-by-wire" instead of mechanical connections, correct? So we are mixing human driving with electronic driving already.

That is completely distinct from the topic of an AI driving my car and are not comparable.
posted by cellphone at 7:07 AM on August 6, 2013

Now that I have rtfa. How do you even fix stuff like this? I assume you can do firmware upgrades? Maybe?

Yeah, you just print out the firmware in a machine readable format using a special tool, and then you scan it in...
posted by atbash at 7:25 AM on August 6, 2013 [2 favorites]

BBC has picked-up the story now, so has some of the bigger blogs like Ars Technica.
posted by zeikka at 7:42 AM on August 6, 2013

I'll certainly think twice now before photocopying anything with it.

There isn't any indication that this affects copying; it's an artifact of the (overly-enthusiastic) compression in scan-to-PDF.

There's a follow-up post: there is a quality setting in the copier UI:

The reader raised the quality from “normal” to a higher setting, which – counter-intuitively – reduced the readability of the scanned document, however, reduced the number of mangled numbers drastically (maybe even to zero).

posted by We had a deal, Kyle at 8:03 AM on August 6, 2013 [1 favorite]

There isn't any indication that this affects copying; it's an artifact of the (overly-enthusiastic) compression in scan-to-PDF.

Ah, right. Somewhat reassuring. But I probably use that machine as much to scan as to copy, so only somewhat.
posted by rory at 8:36 AM on August 6, 2013

Xerox says Always Listening To Our Customers: Clarification On Scanning Issue, a classic example of a bullshit press release that neither clarifies the problem nor admits responsibility.
posted by Nelson at 12:40 PM on August 6, 2013 [2 favorites]

Luckily we don't have Xerox copiers here at work. Like many companies, we have had a paper reduction initiative that centers around scanning in paper documents, putting them into a document management system that has multiple backup mechanisms, and then shredding the paper documents. There just aren't giant binders anywhere with ~1 million lengthy physical documents in them. Generally a human checks the scanned documents for quality but if a lease amount said "$2250.00" instead of "$2150.00" I don't know if that would set off any alarm bells. Even worse, things like errors in lease expiration dates might not be noticed for many years but would cause a huge problem down the road.

Heck, I'm a computer programmer but not being a copier/firmware person I just assumed scanners faithfully reproduced the original content the same way photocopies do, except for artifacts introduced by dirt or smudges on the copy mechanism.
posted by freecellwizard at 1:06 PM on August 6, 2013

plinth: "I say 'we' because I worked on Acrobat 1.0"

BURN THE WITCH

(Disclosure: former and probably future technical writer who has spent way too much time with FrameMaker and Acrobat)
posted by scrump at 2:00 PM on August 6, 2013 [1 favorite]

rory:
Yep, if I had any Xerox stock right now I wouldn't for long.

And my first reaction was: Hell, if anything it's time to buy... just after the market overreacts.

But then I checked, since the window for timely reaction has passed... and nothing at all really happened to XRX that day. I mean, sure, it bounced down a couple percent after 8/5, but then it bounced back up 1%... and all of that is dwarfed by much bigger spikes earlier in the month.

Typical.
posted by IAmBroom at 11:29 AM on August 8, 2013

Curious message on my phone: "ToTIME20130808: abandon JBIG2 responsible 4 begin of earth econ crisis begin 2008 Sept STOP oops shit too late dammit Vern YOU HAD ONE JOB!"
posted by Monkey0nCrack at 8:41 PM on August 8, 2013 [1 favorite]

Here's something quite extraordinary: a live example of a document apparently modified by faulty compression!
1912 Eighth-Grade Exam Stumps 21st-Century Test-Takers

Errors appear to have been introduced into some of the questions. The ones I have spotted are "eneeavor" for "endeavor" in the spelling test; and in the arithmetic questions , q. 3, "dodr" for "door" and in q. 5, "a/c" for "p/c".
posted by Joe in Australia at 3:27 AM on August 15, 2013

First Wave Of Scanning Software Patches Ready. Most passive-aggressive downplaying of a bug I've ever seen, but they are fixing it.
posted by Nelson at 9:59 AM on August 23, 2013

« Older Domestic spying now (secretly) used by law... | How DO they get the graffitti there? Newer »

This thread has been archived and is closed to new comments

MetaFilter

Cat images reportedly unaffected
August 5, 2013 7:06 AM Subscribe

Tags

Share

Cat images reportedly unaffected August 5, 2013 7:06 AM Subscribe

Tags

Share

Cat images reportedly unaffected
August 5, 2013 7:06 AM Subscribe