"I felt a little bit f'ed over after the last conference call with them"
August 26, 2015 12:13 PM   Subscribe

In 2013 computer researcher David Kriesel discovered that certain Xerox scanners were altering numbers in the documents they scanned (MeFi post) At the recent FrOSCon, Kriesel gave an hour-long talk recounting his experiences discovering and reporting the problem, with lots of details on what it was like dealing with a large multi-national corporation like Xerox, and what the impacts and fall-out of his discovery have been. (SPOILER: Germany has eliminated JBIG2 as a legally admissible scan format.)
posted by benito.strauss (20 comments total) 30 users marked this as a favorite
 
I'm kind of impressed the Xerox managed to get a patch out in less than a month after Kriesel discovered the bug...
posted by mr_roboto at 12:31 PM on August 26, 2015


Wanna bet Xerox already had the patch, but wasn't going to release it unless someone noticed the problem? And, even then, it was going to be handed out on a case-by-case basis, in order to avoid the cost (and bad press) of a world-wide release? But, had to do the big release because poop hit the fan?
posted by Thorzdad at 12:36 PM on August 26, 2015 [2 favorites]


Well, he didn't discover it - he publicized it. Some parts of Xerox knew about it and had a tiny readme suggesting that the 'default' compression might be a problem for some applications. Check the timeline.
posted by rmd1023 at 12:37 PM on August 26, 2015


Some parts of Xerox knew about it and had a tiny readme suggesting that the 'default' compression might be a problem for some applications.

But it turned out compression wasn't the only issue; on Aug. 11th Xerox said it was a bug that affected all compression modes. It's not clear whether or not they knew about this bug previously.
posted by mr_roboto at 12:44 PM on August 26, 2015


I am amazed that Xerox uses a compression approach that can possibly substitute characters at all! And there I was thinking I understood photocopier technology.
posted by pulposus at 1:41 PM on August 26, 2015 [4 favorites]


Can you imagine the effort, pain and injustice this kind of thing can cause? These things can turn up on morgages, debt statements, geographic claims ... all over the place. And these 'mistakes' will be taken for truth, so if you're on the wrong end of a bug like that, good luck getting out of it.

Often such a mistake is corrected, maybe, but in certain cases, you'd really have to be lucky and you'd have to find the source document. Which might have been destroyed, mislaid, decayed. And what about errors which only come up decades from now?

Wow ... sometimes the implications are odd to ponder.

Decent device for a story, though.
posted by MacD at 1:43 PM on August 26, 2015 [6 favorites]


That story's already been written, it's called Brazil.
posted by emjaybee at 1:44 PM on August 26, 2015 [30 favorites]


this is so beautiful because it's such an "AI like" error - if you were a machine, with limited time, and you had to copy stuff, it's the kind of error you might make. and this kind of thing is going to happen more and more as we rely more and more on statistical / fuzzy solutions to hard problems.
posted by andrewcooke at 1:48 PM on August 26, 2015 [8 favorites]


A screen shot of the 2013 mefi thread was included in his talk.

I wonder how many people have died or will die because of this shitty compression scheme (e.g., when used in medical care / research, engineering, etc.).
posted by D.C. at 2:09 PM on August 26, 2015


...Isn't this the exact same video embedded in the page in the first post?
posted by emptythought at 2:18 PM on August 26, 2015 [1 favorite]


Yes it is. His original post is "Posted on 2013-08-02 by David Kriesel", while the YouTube video is "Published on Aug 22, 2015". I'm fairly sure the talk is recent; I think he went back and inserted the video of his recent talk in his old post. I found the video from a recent post to /r/Programming and didn't think about the duplication.

You just can't be sure about any document any more, can you?
posted by benito.strauss at 2:29 PM on August 26, 2015 [2 favorites]


That was a pretty good talk, thanks. I wonder how many fucked up documents are out there still. Have there been any high profile screw-ups that can be traced back to this?

A screen shot of the 2013 mefi thread was included in his talk.

Metafilter, the "tech portal". :)
posted by ODiV at 2:40 PM on August 26, 2015 [2 favorites]


> I wonder how many people have died or will die because of this shitty compression scheme (e.g., when used in medical care / research, engineering, etc.).

I'm sure that there are disclaimers for medical use. If you've ever had to build anything from electronic components, there's always the huge NOT FOR USE IN MEDICAL OR MILITARY APPLICATIONS warning.

Making JBIG2 an inadmissible format is a bit of a knee-jerk response. First off, it's not trivial to discover the compression of a document. Secondly, there's a chance that some clever document processing system will use it as an intermediate format, so your seemingly legit G4-compressed document could have been through the nasty format. JBIG2 is a cool format; if you know what you're doing with it.

For truly hilarious compression artefacts, nothing beats JPEG 2000: blurry watercolour splodges replace detail.
posted by scruss at 3:12 PM on August 26, 2015


I don't know how much it's an essential part of JBIG2, but "replace part of document with pixel-close other part of document" seems like exactly not what you'd want for compressing text images. In fact, I think JPEG compression artifacts would be preferable -- that successfully flags the problem the compression algorithm had. JBIG2's strategy is like putting a very nice paint job over substandard or moldy sidewall -- looks convincing but is wrong.
posted by benito.strauss at 5:04 PM on August 26, 2015 [3 favorites]


Has anybody developed a vectorizing compressor for comic book scans yet? I think combining knowledge of the original printing process with the right image compression technology could create much better comic book page scans. You could even have user preferences to re-add page yellowing to taste, or whatever.

I keep dreaming about it, but I'd rather solder than type code.
posted by Chuckles at 5:27 PM on August 26, 2015 [1 favorite]


DjVu?
posted by benito.strauss at 5:41 PM on August 26, 2015 [1 favorite]


Sure, but.. Look at the Obama birth certificate image in the presentation. One number ends up being part of the background, so it looks blurry and unsaturated, where most other numbers are part of the foreground and appear black with sharp edges. When I was scanning the odd comic close to 10 years ago now, DjVu routinely made that kind of mistake, causing inked lines to randomly appear as foreground in full black, and background in blurry dark grey. It actually made comics look much worse.

However, there are some constants in the comic printing process that can be leveraged to improve things: anything remotely like black is black, period; colour is a matrix of dots on a regular grid, one of an isolated set of colours; lettering always has a frame around it; etc. The printing processes evolved over the years, of course, so you would adjust the algorithm for the process used in the original publication.
posted by Chuckles at 6:42 PM on August 26, 2015


Wanna bet Xerox already had the patch, but wasn't going to release it unless someone noticed the problem?

Don't attribute to malice what can be explained by incompetence.

Like I said in the original MeFi thread, this type of compression artifact is a feature of JBIG2 when it's pushed into it's lossy mode. You can configure the compressor to not do this at a loss of compression ratio, which is reasonable and under those circumstances, it still performs better than CCITT Group 4, which is your next best bet.

The issue was far more likely that PDF and highly compressed PDF were a marketing checkbox, an engineering working on this product integrated the code (Xerox is a member of the consortium that handles JBIG2, so no big surprise that they use it since they likely have low cost/free licensing) WITHOUT REALLY UNDERSTANDING IT. The company I work for ships a JBIG2 codec and we have the lossy/lossless setting front and center because that's the way it should be.

When this event happened, I searched through the Xerox manual to see what of the JBIG2 codec specifics were exposed and it's basically none. So IMHO, it was a move made out of ignorance: make the files small.

This is not the only time I have witnessed this - there was a customer of ours that was tasked with "make our scanned document repository small(er)" and they wanted guidelines/sample code - I provided some code that would do a bunch of things: check to see if the image is really in color, if not, drop it to grayscale, if it's grayscale, decide if it's a candidate for 1-bit, drop the resolution to a selectable amount in the process, and if you select JPEG, expose a configurable amount of compression, with a note that pushing the resolution drop or the JPEG compression would affect quality. The engineer on the other side of the coin didn't think about this, set those to do maximum compression and then set it loose on their document repository, effectively destroying it WITHOUT KEEPING A BACK UP. And while it would be bad form of me to say where this happened, I will say simply that this particular engineer destroyed a shit ton of documents that could more accurately be described as 'evidence', all because he didn't bother to learn about the tools he was using nor did he use proper care before unleashing it.

I consider the Xerox thing to be a cock up of smaller scale, but much higher profile.
posted by plinth at 7:59 PM on August 26, 2015 [7 favorites]


D.C.: I wonder how many people have died or will die because of this shitty compression scheme (e.g., when used in medical care / research, engineering, etc.).

Probably fewer than will die because of any of 1,000 recalled auto defects that didn't reach 100% of the consumer base, or because a slightly shittier grade of coal was burned in a power plant one day, or....
posted by IAmBroom at 10:37 PM on August 26, 2015 [3 favorites]


For truly hilarious compression artefacts, nothing beats JPEG 2000: blurry watercolour splodges replace detail.

Well, at least that's in line with what people sorta expect to get out of digital systems. The weird/dangerous thing about the Xerox bug is that it degrades images in a way that people reasonably don't expect.

Most people understand how you can lose a document due to low resolution: if you take a 300 dpi document and downsample it to 72 dpi, you're losing detail. Anyone can see how that happens just by zooming in on it. (And a few decades ago it was easy to explain this to people coming from the analog world, because it was analogous to film grain.)

JPEG artifacts are a bit weirder, but as they've become more common people have gotten used to see them. They're pretty obvious once you've seen a really bad compression job once or twice. (The fact that many video formats use JPEG-like block compression that fail in the same way probably helps a lot too.) And when they chew up an image, it's obvious.

The Xerox thing is fairly unique as far as image degradation goes, at least that I've heard of. You don't expect a 7 to become a 1 in a photo without the whole thing being blurry or blocky; that's not a failure mode that people have been taught to expect (and in fairness, it's pretty rare).

Though I think the real takeaway point that people ought to get from this is not to always blindly believe in the infallibility of the data in the computer. The number of corrupt images produced by Xerox scanners is probably quite a bit smaller than the amount of data in databases, stored numerically or as text, that's been silently corrupted due to soft errors. If you wanted to update Brazil for the 21st century, that's what it would look like: instead of a fly in a typewriter, you'd have a radioactive decay particle or cosmic ray striking a non-ECC RAM chip, or some weird electromagnetic spike on a data bus line, just enough to flip a bit and lead to an uncorrected error in some poorly-designed system.

When data was mostly handled by humans, typing and retyping it or copying it by hand from one sheet of paper or card to another, there was an understanding that you'd occasionally get "keypunch errors" and a procedure would exist, ad hoc or official, to fix things. Perversely, as we have made simple transcription errors less likely, we've eliminated the procedures to fix them when they do happen, with the result that we've gone from errors being a high-probability/low-risk event to a low-probability/high-risk one. It's not clear that's exactly a win.
posted by Kadin2048 at 6:11 AM on August 27, 2015 [2 favorites]


« Older Can they claim a senator as a tax deduction?   |   Big Dig Newer »


This thread has been archived and is closed to new comments