We are going to hang in there until it works
March 8, 2018 6:59 PM Subscribe

Why the PDF Is Secretly the World's Most Important File Format
posted by Chrysostom (91 comments total) 32 users marked this as a favorite

GIF or go home
posted by sammyo at 7:09 PM on March 8, 2018 [1 favorite]

Oh yeah. I mean, I don't think "secretly". The PDF is the glue that holds the white collar office world together. I feel like this is largely because actually moving the office into the future is impossible because of lawyers so instead, have e-paper!
posted by selfnoise at 7:11 PM on March 8, 2018 [15 favorites]

GIF or go home

It’s pronounced gif, stupid.
posted by Sangermaine at 7:24 PM on March 8, 2018 [30 favorites]

A case of DJVU?
posted by miyabo at 7:42 PM on March 8, 2018 [4 favorites]

In the printing industry PDF is king. It has changed my workflow to a monumental degree. From messy, smelly, bromides and film to all electronic. The Preflight tools in Acrobat are fabulous. The only area I have problems is in the Accessibility section - and oh boy! are they big problems.
posted by unliteral at 7:58 PM on March 8, 2018 [5 favorites]

I hate PDF, but that's because I've been asked a million times to implement some stupid report or custom letter as a PDF. What took took 15 minutes in HTML/JS to display ends up taking hours or days to do as PDF because all the free tools suck.

Want to print? CTRL+P in your browser. Don't like the layout? Too bad.

(Yes, PDF generation just came up again this week.)
posted by Ickster at 8:04 PM on March 8, 2018 [8 favorites]

PDF is basically indispensable for us in mathematics-heavy fields too. If there was a way I could type complicated tensor calculus expressions directly into an email I would be all over it. Instead I have to write a whole separate LaTeX file, compile it to a PDF, and attach that. And then of course, if someone wants to comment on some part of a formula, they have to reply in text and we all have to parse and render their LaTeX code in our heads.

Any math nerds in the house? You know what I'm talking about.
posted by a car full of lions at 8:28 PM on March 8, 2018 [16 favorites]

uugh, my pops is an academic author currently working on a revision of a book with his coauthor. They submitted intial proof revs in Word to the proof team and are receiving them back in commented PDF, which his coauthor (running Win something) can see and mark up, but when my dad opens the files, he can see the proofer's work sometimes or once or occasionally not at all, running Mac Sierra or possibly High Sierra. As a result, his workflow is to open two copies of the proofer PDFs on different machines and commit his revisions to a separate textfile as he looks from screen to screen. I tersely informed him that if I were his manager I would make him match his coauthor's hardware setup and use that, but, you know, he's my dad.

In my life, I occasionally get layout gigs for a direct-mail catalog for a small company with a bunch of pals working for it. The layout work is executed in InDesign and submitted in PDF to the company exec, who uses an off-brand PDF reader and editor to submit the revs back. Naturally, that program cannot read the PDFs I generate in my copy of iD to meet the exec's expectations and needs with respect to creating and submitting the revs.

In short, screw PDF.
posted by mwhybark at 8:39 PM on March 8, 2018 [3 favorites]

Like other people are saying, my own experience is of trying to submit homework in PDF form in a way my profs would be able to read, then later to try to load my students' PDF homework successfully the same way. PDF isn't "a" file format, it's a metastatic tumor of file dialects.

The more niche the writer, the less likely the files it emits would be successfully readable. For whatever reason SolidWorks always seemed particularly bad at it.

And now in the private sector, the company I work for spends far too many developer hours on ongoing support for importing and exporting PDF files successfully. Thank heavens I've not been on that job, though I have seen how many problems they deal with.

P.S. a car full of lions, have you ever used Overleaf? Not bad for online collaborative LaTeX editing.
posted by traveler_ at 8:51 PM on March 8, 2018 [7 favorites]

In my work i often have to replicate PDFs provided to us by new customers within our own software. The distinction between "We made this in Word or whatever and printed it to PDF" and "We scanned this to PDF, on a dirty scanner, 15 degrees off center, and from a 9th generation photocopy" is a VERY IMPORTANT one.
posted by thecjm at 8:52 PM on March 8, 2018 [10 favorites]

It's so nice to know that closed source proprietary bullshit with all kinds of security issues is "king" in the workplace.

Adobe can get off my lawn.

They can pry my GIMP, Inkscape, and Audacity from my cold, dead hands. (Because there are zero good open source options for creating/editing PDF's, so why fucking bother?)
posted by deadaluspark at 9:03 PM on March 8, 2018 [8 favorites]

I used to use gmailtex for typing equations in Tex in an email.
Now I just type them in Tex. Whatever, anyone reading them should know how to read Tex anyhow. And anything more complicated lives on Dropbox or sharelatex.

PDF is kind of a disaster, but editing Word files if you don’t have Word is worse.
posted by nat at 9:16 PM on March 8, 2018 [2 favorites]

I stopped taking PDF seriously when I found out they'd extended it to allow for embedded Javascript.

PostScript is a complete computer language. Any task that can be expressed in Javascript can also be expressed in PostScript.

The whole point of PDF was to be a cut-down subset of PostScript that was not a complete scripting language and therefore wasn't so demanding on the limited processors of the day. If the format had any integrity at all, and if the designers had any real interest in minimizing its security attack surface, such scripting as it needs would be provided by relaxing restrictions and allowing more of PostScript (perhaps just enough to allow for execution of transpiled Javascript) to be supported, not by bolting in a second script interpreter.

Now that it's evolved into just another everything-and-the-kitchen-sink container format like RIFF (and a stupidly overcomplicated RIFF, at that), PDF has lost all claim to actual portability.

There's a portable subset of PDF that does remain genuinely useful: it's those tiny parts of present-day PDF that used to be known as, you know, PDF.
posted by flabdablet at 9:18 PM on March 8, 2018 [12 favorites]

There really should be a good open source option for creating and editing PDFs, because the point of PDF as a format is that it's just PostScript, a language basically every printer understands. There's an open-source tool that re-implements PostScript called Ghostscript, and it does PDF conversions.

Like, I'm genuinely baffled that the open-source world hasn't noticed. Mac has PDF generation built into the operating system, because it's basically a print operation.

Is everyone using LaTeX over there or something? (What is the deal with LaTeX font any way?)
posted by Merus at 9:18 PM on March 8, 2018 [1 favorite]

Mac has PDF generation built into the operating system, because it's basically a print operation.

So does any Linux that uses CUPS for printing, which is most of them, and so does Windows 10 (Microsoft having now apparently given up on trying to convince people that XPS is good for something).

The point at hand is that good PDF editors - software capable of taking an existing PDF and making alterations inside it - are thin on the ground.
posted by flabdablet at 9:25 PM on March 8, 2018 [9 favorites]

The reason it's hard to make a general purpose PDF editor, by the way, is that PDF was initially designed as a presentation format, not a semantic format, and such semantic support as now exists in it has been bolted on afterwards.

I've seen, and attempted unsuccessfully to edit, PDF documents full of text that looks absolutely perfect but where none of the words are selectable for editing, apparently because the printer driver that created the document was given the content to render as a set of fixed-size and quite small tiles covering the output page.

Printer drivers are designed for faithful rendering of whatever document you feed them, and are almost never designed to preserve the semantics of those documents in their output. This is because most printer control languages have no ability whatsoever to represent semantic entities such as words and paragraphs and tables; PDF and PS can do this but are exceptional in this regard, and if semantic information does survive the print driver translation layer this is largely by accident. The PDF export facilities built into most modern word processing software do a rather better job.

So because the overwhelming majority of extant PDF documents are in fact printer driver output, much of their semantic content is just scrambled and broken, and this is what makes real-world useful PDF editors so hard to come by. Because to be useful they have to do much, much more than a typical word processor: they also have to solve what amounts to a complicated jigsaw puzzle before even presenting the user with the content to be edited.
posted by flabdablet at 9:47 PM on March 8, 2018 [42 favorites]

In 50 years, these PDFs, even with their weaknesses, will help us document history with little of the ephemeral nature of the web. And unlike in paper form, those PDFs won’t suffer from frayed pages.

And just like in paper form, the most feasible way to extract machine-readable versions of most of the information these things contain is going to require running them through OCR and fixing the resulting errors by hand. Perhaps we will see a whole new wave of illuminated manuscripts emerge from our digital Dark Age.
posted by flabdablet at 9:55 PM on March 8, 2018 [5 favorites]

PDF is a ridiculously overcomplicated format that is hard for open source software to fully support, but Okular is really remarkably good if you don't want to pay for Acrobat Pro.
posted by miyabo at 9:55 PM on March 8, 2018 [3 favorites]

I've seen, and attempted unsuccessfully to edit, PDF documents full of text that looks absolutely perfect but where none of the words are selectable for editing, apparently because the printer driver that created the document was given the content to render as a set of fixed-size and quite small tiles covering the output page.

Or when you can select the text, but when you copy and paste it you get a line break wherever the text wraps.

I've also had plenty of issues where printing from the Illustrator version of a file is perfect, but printing from the PDF copy of that file is awful, which really shouldn't be a problem considering PDF is supposed to be so accurate to print from and we're talking two Adobe products... it's endlessly frustrating.
posted by jason_steakums at 9:57 PM on March 8, 2018 [5 favorites]

when you copy and paste it you get a line break wherever the text wraps

That's the PDF reader/editor you're using attempting to reconstruct semantic information that was lost during translation to PDF: in this case, the distinction between a line break and a paragraph break. There's simply no universally good way to do it.
posted by flabdablet at 10:02 PM on March 8, 2018 [2 favorites]

I find the blind trust folks place in PDFs especially hair-raising since the one time when some numbers in a pdf that Preview on my Mac let me select and copy, when pasted, turned out as other numbers (something like that weird old photocopier effect).

Though this awareneness that every hour we’re dealing with actual arcana can sometimes feel strangely thrilling, like constantly having to detect and tame a secret beast.
posted by progosk at 10:21 PM on March 8, 2018 [8 favorites]

So basically PDF's are great as long as you're not trying to actually create them, edit them, or extract information from them. Just look at the pretty picture and don't peer underneath the many layers if you value your sanity.
posted by dilaudid at 10:27 PM on March 8, 2018 [26 favorites]

PDF is a reasonable electronic substitute for printing on paper. Attempting to make it do almost anything you can't also do with printed paper is asking for trouble.

the one time when some numbers in a pdf that Preview on my Mac let me select and copy, when pasted, turned out as other numbers (something like that weird old photocopier effect)

I've seen PDFs where everything that actually makes it onto the paper is just a single pre-rendered 300dpi compressed image; it's entirely possible that the weird old photocopier effect is exactly what was going on with your PDF, and that the numbers you copied and pasted, being derived from a hidden semantic layer rather than the visible graphics so tempting to take as definitive, were actually the correct ones.

On the other hand, those graphics might have come from a scanner and the "underlying" semantic layer added after scanning (but possibly before image compression) via OCR. Hard to tell without examining the PDF concerned in a text editor.
posted by flabdablet at 11:19 PM on March 8, 2018 [5 favorites]

PDF forensics, clearly an area with unlimited growth potential, given the format’s institutional creep...
posted by progosk at 11:37 PM on March 8, 2018 [4 favorites]

I develop control software for cutting and milling machines for packaging and signage, and when I want to make stuff, either for test purposes or for my private use, I end up not using our custom, advanced design software that ties into the whole packaging toolchain etc etc. I usually want to make something Kaizen foam for my home tools and don't really need the full tool set that Coca Cola use to manage packaging from prototype visualisation and sample making through brand and colour management, production, palletization, store display, point-of-purchase handling and disposal.

What I do is to draw my designs in the FOSS application Inkscape, keep each layer a separate colour, export to PDF which the cutting and milling table happily imports and go on my merry way. So PDF saves me a lot of trouble in my very niche application.
posted by Harald74 at 12:05 AM on March 9, 2018 [2 favorites]

PDFs are hell if you can't easily read print, because they resist text extraction and reformatting with dogged determination. And a lot of the in-browser rendering tweaks for resetting colour, contrast and so on, just don't work with whatever it is the browser is using to render PDFs in-place.

I've tried and failed with many ways to easily read PDFs; my current technique is to import them into Google Docs, because there at least my browser contrast tweakers work. But I read a lot of scientific and engineering papers, where the compulsory formatting conventions of which font is acceptable, footnote, formula and graph presentation, what columnar layout is expected and so on - that are compulsory for publication - don't work well for me at all. I want to spend all my mental energy on understanding the paper, not in crowbarring the content out or guessing what the bits I cannot see might be saying.

PDF is a really good example of an ecosystem with huge assumptions about accessibility, and like many digital such, the biggest frustration is that it doesn't have to be that way. So close, and so far.
posted by Devonian at 12:55 AM on March 9, 2018 [10 favorites]

You can rank on PDF all you want. But when the day comes that you ask someone to send you a copy of a document and what they send you is a jpg image of a page that they took with their cellphone, well, you'll come to appreciate PDF more.
posted by SPrintF at 1:53 AM on March 9, 2018 [10 favorites]

PDF, XLSX, DOCX. Those are the three.
posted by GallonOfAlan at 2:19 AM on March 9, 2018 [1 favorite]

And when Apple added vector artwork to iOS, they settled on PDF as the file format. Which is a colossal WTF.

Presumably there is code somewhere (ideally at compiletime, though probably at runtime) that goes through the PDF, picks out the vector paths, ignoring the myriad of other possibilities for what could be in there, and creates some sort of concise in-memory data structure. One hopes that the sheer openness of the PDF format does not lead to an attack surface measurable in parsecs. (What's to say that, at some tributary of the code path, the PDF isn't passed to a third-party library which runs the JavaScript in it and has a buffer overflow in it somewhere, for example?) Though it still makes me cringe. Especially since, in the age of Swift and algebraic data types, best practice is converging on structuring one's data representations to make illegal states unrepresentable, and bunging PDF in as a vector image format is a Trump-scale fuck-you to this whole notion of best practice.
posted by acb at 2:27 AM on March 9, 2018 [2 favorites]

"A case of DJVU?"

DJVU, and its fucking JB2/JBIG "compression" format, can DIAFF.

For some reason a lot of old schematics & service manuals & such, often poorly printed with values & annotations in tiny fonts & non-English scripts & languages, were scanned by rabid anti-PDF folks during the small window DJVU was slightly popular. And never checked for accuracy because, hey, it's open source & compresses better than PDF so it must be perfect, right?

Or, as it would read in those scans:
"ter semo roasen a /eF of e/d schomaFics & sorvico manua/s & such, efFon poor/y prinFod wiFh va/uos & anneFaFiens in Finy fenFs & nen-Bng/ish scripFs & /anguagos, woro scannod by rabid anFi-PDt fe/ks during Fho sma// windew DJVU was s/ighFly pepu/ar. And novor chockod fer accuracy becauso, hoy, iF's epon seurco & cemprossos boFFor than PDt se iF musF bo porfocF, righF?"
posted by Pinback at 3:36 AM on March 9, 2018 [11 favorites]

PostScript is a complete computer language. Any task that can be expressed in Javascript can also be expressed in PostScript.

Back in the days of automating forms at the insurance/finance company ( with very specific font requirements by the regulators ), I ended up using Vi to hand-code the postscript to print the forms. Think of it like assembly language for printing documents.

Thankfully the Perl module PDF::Merge came out, and we migrate to regulator provided PDF's with our data overlaid on top. Simplifying both my workload and compliance with regulators' requirements.
posted by mikelieman at 3:51 AM on March 9, 2018 [2 favorites]

Back in the days of automating forms at the insurance/finance company ( with very specific font requirements by the regulators ), I ended up using Vi to hand-code the postscript to print the forms. Think of it like assembly language for printing documents.

More like FORTH, only with dynamic types; i.e., everything gets pushed to a stack, which commands consume values from.

Back in the USENET days, I once made a signature which contained PostScript code for rendering my initials in an iterated function system. It was one line over the USENET 4-line signature convention, though, despite having been optimised to within an inch of its life.
posted by acb at 4:01 AM on March 9, 2018 [3 favorites]

PDF is basically indispensable for us in mathematics-heavy fields too. If there was a way I could type complicated tensor calculus expressions directly into an email I would be all over it.

Oh, you're looking for TeX for Gmail.

Enjoy.
posted by entropone at 4:47 AM on March 9, 2018 [3 favorites]

We have blind undergrad and graduate students that use screen readers. While we have accessible pdf tools, the problems with columns, tables, graphs, formulas and math are substantial. If you are using them for courses or online purposes, get familiar with making them accessibly. Adobe and YouTube have plenty of information on how to do this. Less information on math.
posted by childofTethys at 5:01 AM on March 9, 2018 [3 favorites]

As someone for whom handling letters to the editor and reader-submitted op-eds is a major part of the day's duties, I find the PDF to be a recurring pain in the arse.

Please, please, PLEASE, if you correspond with a media outlet, do so in the format that is not unbearably cumbersome to edit and extract information from. Or, as I say countless times every day, "Please cut and paste your submission into the body of the email."

Thank you.
posted by virago at 5:04 AM on March 9, 2018 [2 favorites]

Because there are zero good open source options for creating/editing PDF's

I would honestly love to hear about them. I have to do some basic PDF editing on the regular, and I haven't found any that do what I need them to (crop pages, add images, add new pages from the clipboard).
posted by Rock Steady at 5:06 AM on March 9, 2018

PDFescape works better than any of the open source PDF editors I've experimented with.
posted by flabdablet at 5:09 AM on March 9, 2018 [4 favorites]

as I say countless times every day, "Please cut and paste your submission into the body of the email."

Attaching a simple text file works too, and can avoid some of the violence that email clients like to commit against plain text.
posted by flabdablet at 5:12 AM on March 9, 2018 [1 favorite]

...because the point of PDF as a format is that it's just PostScript, a language basically every printer understands.

Oh, if only that were true. Even today, most office-level printers you run across offer Postscript as an add-on module, and not part of the out-of-the-box configuration. And those add-ons are often priced very dearly. At the last office job I had, I was the one-man art department (using the only Mac in the building, of course) There were about a half-dozen big office printers in the place, none of which spoke Postscript, which I needed to print proofs and mock-ups. IT finally (and begrudgingly, because Mac) got a PS module for the one really good printer.
posted by Thorzdad at 6:00 AM on March 9, 2018 [1 favorite]

What I find most questionable is how many US Government offices, such as the Department of State, the Patent Office, etc., absolutely require the use of PDF format, even for simple one page submissions. Seems like an unnecessary dependency on proprietary formats beyond the people's immediate control.
posted by StickyCarpet at 6:02 AM on March 9, 2018

Even today, most office-level printers you run across offer Postscript as an add-on module, and not part of the out-of-the-box configuration.

And the PostScript implementations are sometimes buggy even when they do come as standard. I once had to switch an entire fleet of school workstations driving classroom Samsung ML-3471ND printers over from PostScript to the rather less network-friendly PCL6 just because one of the teachers had found a document on the Web that would make those printers lock themselves up to the point of needing power-cycling to unstick if printed via the PostScript driver.

Apart from a typewriter emulation mode for mostly unformatted text, PCL6 requires that the computer does all the rendering and then transfers the results into the printer as a massive but very straightforward pixel map. It really doesn't matter at all what was in a PCL6-rendered document by the time the printer sees it, because the most complicated job the printer could possibly need to do as a result is image scaling.

This is in sharp contrast to PostScript, which is capable of asking the printer itself to perform arbitrarily complex computation, complete with all the failure modes (e.g. running out of RAM) that could possibly go with that.
posted by flabdablet at 6:29 AM on March 9, 2018 [2 favorites]

Adobe can DIAF for creating a new format of PDF, called XFA -- or colloquially a "Livecycle PDF" -- which has the PDF extension but is most definitely not a PDF and is not backwards compatible with PDF libraries that have been around for years, causing customer to blame us because the file is obviously a PDF so why doesn't your software open it, so it must be your software is too buggy to use. Note that even in the Adobe world, if you open up their all-powerful fully functional Adobe Acrobat program, and try to open an XFA PDF, it says, sorry, it cannot. XFA is a blight on the PDF format, if only because Adobe has enough clout to assign the same extension to wildly incompatible formats simply because they have the only viewer that natively supports both.

Yes, I'm a little bitter. A good chunk of my support calls are because of users trying to send XFA PDFs through software not designed to handle it.
posted by AzraelBrown at 6:48 AM on March 9, 2018 [5 favorites]

how many US Government offices, such as the Department of State, the Patent Office, etc., absolutely require the use of PDF format, even for simple one page submissions. Seems like an unnecessary dependency on proprietary formats beyond the people's immediate control.

TFA explains some of this, but PDF is a number of different things, and a lot of people understandably confuse Adobe's product, the Reader/Editor, with the spec, which has been open since 1993 and is an ISO standard. PDF's archival formats (PDF/A etc.) are specifically intended to preserve documents in a way most likely to assure that they're readable in perpetuity (not relying on external fonts, etc.) while maintaining a chain of provenance.

PDF is actually quite good at what it does, people just keep trying to make it do more things.
See also Excel, which is in my mind MS's best product by far, but due to ubiquitousness gets used for things like project management for which it was never intended.
posted by aspersioncast at 7:04 AM on March 9, 2018 [9 favorites]

PDF is a great terminal file format. End of the pipeline before it hits the paper. That’s it.
posted by oceanjesse at 7:14 AM on March 9, 2018 [2 favorites]

And when Apple added vector artwork to iOS, they settled on PDF as the file format. Which is a colossal WTF.

Because the Quartz stack is literally a working GPU accelerated implementation of Display Postscript.
posted by Talez at 7:22 AM on March 9, 2018 [1 favorite]

PDF is a reasonable electronic substitute for printing on paper. Attempting to make it do almost anything you can't also do with printed paper is asking for trouble.

TRUTH.

PDF is *great* as a terminal format, but it's not something I'd want to use as a collaboration format. Shit, I even hate using Word's markup features; just tell me what changes you want in a companion email, with page numbers and/or headings. That's easier to track for me.
posted by uberchet at 7:33 AM on March 9, 2018 [5 favorites]

When it works as intended, PDF can be very, very good.

I once volunteered to do a newsletter for a club. I was informed by the person who used to do the newsletter that I would need to buy and learn QuarkXpress, because the printer only accepted QuarkXpress files, and I would have to take the newsletter to them on an external drive with all the relevant files and pictures because there was no other way to bundle it all up. I called the printer and asked if they could accept a PDF prepared out of InDesign with "for press" settings. Yes, that would be just fine. Problem solved.

The next person who took over the newsletter gave up on printing it and just emailed around a Word file.
posted by lagomorphius at 7:34 AM on March 9, 2018 [2 favorites]

And when Apple added vector artwork to iOS, they settled on PDF as the file format. Which is a colossal WTF.

Because the Quartz stack is literally a working GPU accelerated implementation of Display Postscript.

NeXT used Display Postscript, so this goes back a while.
posted by lagomorphius at 7:36 AM on March 9, 2018 [1 favorite]

Another annoyance I get calls about are PDF viewers that can view most any PDF just fine on screen, but blow up when you try to print the document. Microsoft Edge makes itself the default PDF viewer in Windows 10, and then clogs up the print queue with ... exactly what I don't know ... but it never comes out the printer. Mac OS's viewer displays documents that look just fine until you go to the printer and find a pile of blank pages. Or a pile of solid black pages. So everybody has to install Adobe Acrobat Reader and make it their default because Adobe Acrobat Reader is the only thing that will actually print anything you throw at it.
posted by lagomorphius at 7:42 AM on March 9, 2018

As someone who printed a lot of stuff that had to be properly formatted (from business cards to administrative supplies) PDF sorted out a lot of problems before they existed - all I needed was make sure the margins cleared and tell the ones printing to respect sizes. For digital implementations of multi-page printed material, it's also OK (as long as the text can be searched, and it's not just scans).

Most other uses range from unnecessary to wrong.
posted by lmfsilva at 7:44 AM on March 9, 2018 [1 favorite]

Lately something else that's been bizarre with pdfs on my Mac (High Sierra) is that over time the icons switch themselves from Preview pdf icons to Adobe pdf icons, and with it the default app they open in. In the "Open with..:" panel, Preview no longer appears as an option, and if I go into the file's Info window, in order to switch the default "Open with..." app back to Preview, I need to navigate to it, it's not offered as a primary option. Then, even if I toggle the "Apply to all " option, this lasts for a couple of days, until things get switched again.

This is so weird that I'm actually worried about my Mac - but then would you put it past Adobe not to try shit like this?

Truly, the format is proving to be their perfect trojan horse...
posted by progosk at 8:16 AM on March 9, 2018

Ooh, yes: XFA is very special. The way it is always trivially encrypted is a headscratcher.

But PDF (+ a person) was responsible for a very expensive mistake for a publisher I worked for. Proofs for approval had to printed with the utmost accuracy, because once you signed off, any issues were at your cost, not the printer's. Proofs used to come in on shiny bromides and get physically measured and signed off by the production and editorial team. Then we started getting them as PDFs, which the production team would very carefully print to size on a specially calibrated (size corrections! pre-heat print jobs! paper specially acclimatized!) laser printer for signoff.

But for one book, a new senior editor got the proof PDFs directly by e-mail. We were working on the calibrated proofs when they came by, all smug, saying we should stop as they'd already signed off on PDF proofs printed from their desktop laser. Yes, they had really done "Print to Fit", so this book ended up being bigger than the rest of the range and needing special boxes to ship it. The costs were in the hundreds of thousands of pounds.

They kept their job, for some reason.
posted by scruss at 9:29 AM on March 9, 2018 [6 favorites]

Metafilter: Most other uses range from unnecessary to wrong.
posted by Space Kitty at 10:07 AM on March 9, 2018 [5 favorites]

I spent an hour yesterday trying to export a document with images from Microsoft Word to a PDF in a way that didn't drop the Word Art, on both a Windows machine and a Macbook pro. It was not pleasant. I'm not sure if these incompatibilities were due to antagonisms between Adobe and Microsoft, or Apple, or just one of those things we live with. The internal translator worked on the Windows machine, but the Adobe plugin was a mess, and the opposite was true on the OS X machine.
posted by mecran01 at 10:22 AM on March 9, 2018

I spent an hour yesterday trying to export a document with images from Microsoft Word to a PDF

Couldn't you just print to PDF on the Mac?
posted by mwhybark at 11:11 AM on March 9, 2018 [1 favorite]

I spent an hour yesterday trying to export a document with images from Microsoft Word to a PDF in a way that didn't drop the Word Art, on both a Windows machine and a Macbook pro. It was not pleasant.

To be fair, that's just technology's subtle way of reminding you that Word Art is the Wrong Thing.
posted by flabdablet at 12:17 PM on March 9, 2018 [1 favorite]

Devonian, do people in your field publish preprints to ArXiV? It lets you download DVI or even the raw source files.
posted by MengerSponge at 12:38 PM on March 9, 2018

it's those tiny parts of present-day PDF that used to be known as, you know, PDF.

Out of curiosity, is there a defined dialect that maps to the only-a-subset-of-PostScript part of PDF? Is PDF/A basically that?

At one point I thought you could select PDF/A in the MacOS print-to-PDF dialog, but it seems to have been taken out and shot by Apple sometime in the intervening years in the name of "simplicity", I guess. But I used it fairly often when I was doing technical writing to try and produce output that would be readable across various platforms.

(You can still produce it using Ghostscript, it's just not as easy as the dropdown box in the Print dialog that I swear used to exist.)
posted by Kadin2048 at 12:49 PM on March 9, 2018

I've worked professionally with PDFs for over 15 years. There are features from Acrobat 4.0 that were removed in 5, that I miss horribly.

It's terrific as a print-ready format. It's horrible for pretty much everything else. It makes hard-to-read ebooks, clunky business documents, and horrific collaborative docs. And the new changes that mean "convert to PDF" is different from "print to PDF" are not helping anything.

I love PDF's archive abilities but wish companies would stop using it for in-process business documents; dealing with "RFQ version 3/4/7.6/9.1.1 revised" in PDF is a pain, and trying to track down the requisite Word doc is a study in corporate ineptitude. ("Oh, but we only send out the PDFs because the docs use our custom fonts, and not everyone has those." Well, dipshit, if those are The Official Company Fonts, make sure they're installed on Official Company Computers, and you won't have that problem.)

Feel free to hit me up for a two-hour rant about the problems with pdfs-as-ebooks and why ereaders all suck at PDFs.
posted by ErisLordFreedom at 1:06 PM on March 9, 2018 [6 favorites]

...export a document with images from Microsoft Word to a PDF...

Well, there’s yer problem. /s

Word outputs some of the most heinous PDFs. It’s almost as if Microsoft was pissed that they didn’t invent the PDF, so they implemented a purposely deranged export engine in Word.

Seriously. Surprisingly often, I have to break-open a PDF (using Illustrator) to extract images, logos, etc. for a new job. It can be a bit of a mess, depending on how complex the art/design in the PDF is. But a PDF exported from Word (and, yes, businesses actually do create brochures in Word) is its own kind of dumpster fire. Clipping masks inside clipping masks inside clipping masks, etc. etc. etc. Arrrrrrrrrgh!
posted by Thorzdad at 1:13 PM on March 9, 2018 [2 favorites]

I remember 25 years ago when I believed that document portability was soon to be a solved problem.
posted by Ivan Fyodorovich at 1:30 PM on March 9, 2018 [7 favorites]

My D&D campaign has ended up using some crazy PDF files for our character sheets. They don’t work in OS X’s Preview, only in the Acrobat viewer - in fact they claim that even *opening* them in Preview can irrevocably fuck them up, though that hasn’t happened to me yet. They are chock full of all kinds of JavaScript to deal with all the arcane calculations involved in building and leveling up a D&D character; they have their own *toolbar* that pops up. It’s kind of insane.
posted by egypturnash at 2:20 PM on March 9, 2018 [2 favorites]

→ It’s almost as if Microsoft was pissed that they didn’t invent the PDF

Oh, they've got their own back by now. Every Word document published as PDF on the web has a title starting "Microsoft Word …". It's as if they looked at the metadata standards and then decided, "nah, fuck 'em."

The one very strange omission from the feature-bloated PDF standard is non-rectangular pages. I mean, c'mon …
posted by scruss at 2:30 PM on March 9, 2018

Has anyone ported Doom to a PDF?
posted by paper chromatographologist at 2:40 PM on March 9, 2018 [10 favorites]

The monopolies dominating computer software suck. All of them. That said, (OK maybe screamed over the rooftops), people need to use appropriate software. A lot of the complaints here are about functions PDFs cannot perform. You should never ever edit anything in Acrobat. (Insert long lists of stuff you should never ever do in Acrobat) I use at least 10% of my work time trying to convince people to use the software that is appropriate for whatever they are doing, and 5% arguing with management that they should invest in that software even though that means paying evil monopolies for stuff any kid can do with a ruler and a scissor and an old-school type-writer. I hate it too. But if you hate it and you are huge publicly financed university, develop your own goddam open source stuff.
posted by mumimor at 2:40 PM on March 9, 2018 [3 favorites]

I like editing PDFs, mainly because it feels like I'm pulling one over on the people who think PDFs can't be edited.

"Oh, your expected process is download this form, print it out, sign it, scan the signed version, and email that to you? I don't have a printer, but hey, I've scanned my signature before; I'll just copy that out in Photoshop, remove the line, save as a transparent PNG, paste that into the PDF, resize it to fit the line available; then create form fields to fill out everything that doesn't require a signature; and then - to avoid the weirdness of image blocks loading separately and the risk of someone editing changes into it by accident - save the PDF as a full-page image file, turn that back into a PDF, and OCR it (because I'm keeping a copy, and the OCR'd version will show up on searches later)."

I am appalled at how many court cases rely on printouts of PDFs as evidence, because I know how easy those are to change.
posted by ErisLordFreedom at 3:41 PM on March 9, 2018 [5 favorites]

I like editing PDFs, mainly because it feels like I'm pulling one over on the people who think PDFs can't be edited.

You have a future in politics, my friend.
posted by deadaluspark at 3:54 PM on March 9, 2018 [3 favorites]

"Has anyone ported Doom to a PDF?"

Don't know - but there used to be a webserver http daemon written in postscript, as well as (less impressively because it's basically its job) quite a few raytracers.
posted by Pinback at 4:03 PM on March 9, 2018 [1 favorite]

I am far from a PDF expert and this is what has been working for me.

Several years ago I was looking for software to export images from files that I could only get in PDF format. I researched a bunch of PDF software and most of it just didn't do what I wanted. I happened across PDF-Xchange viewer, a free program that does exactly what I want it to do. I have been using it ever since. It has been updated to PDF-Xchange Editor, with most features free, and some needing a license for more complex things. The non-commercial license fee is $43.50 which seems reasonable to me.

I feel like I am the only person who has ever used it because it never comes up in discussions like these. The company, Tracker Software, makes several additional end-user and developer tools that I haven't investigated, although I am considering a license that would let me combine sets of images into PDFs.

That being said, I agree with everyone who says one needs to use the appropriate tool for the job and PDF is insane for collaborating.
posted by Altomentis at 4:12 PM on March 9, 2018

A few years ago I stumbled across a surprising life-hack for viewing and printing PDFs that choke Acrobat and Acrobat Viewer: Open them in the Chrome browser. It's actually a lot more robust for PDF viewing than the Adobe products.
posted by ZenMasterThis at 4:58 PM on March 9, 2018

I'm just commenting to plug Sumatra PDF for any Windows folks that need and/or want a nice open source PDF (and other document file) reader without all the editing stuff and it's not Adobe anything.
posted by glonous keming at 5:15 PM on March 9, 2018 [1 favorite]

Any math nerds in the house? You know what I'm talking about.

As well as TeX for Gmail, there's a decent Thunderbird add-on called LaTeX It!, also.

(I'm mainly just commenting to second the love for Okular. I live and die by Kile + Okular. If we're talking about stuff that can DIAF, it's Overleaf/Sharelatex.)
posted by busted_crayons at 5:53 PM on March 9, 2018 [1 favorite]

As many quibbles as I have with Acrobat and PDFs, the biggest issues for me are, as always, the users. "Oh, I have our logo as a PDF, will that work?" "Perfect, thanks!" *receives a PDF scan of a business card from their office copy machine*
posted by jason_steakums at 6:23 PM on March 9, 2018 [5 favorites]

ctrl-f "bold"
Phrase not found

I'm disappointed in all of you.

"Bold.. bold.. bold! D-d- One doesn't need bold. Bold is to make up for not being able to properly use an exclamation point. I think it's deplorable."
posted by numaner at 9:09 PM on March 9, 2018

I'm going to show my future children that commercial and it'll be like how I watch old car commercials from the 60's, that wonder and amazement at such an old-fashioned thing!
posted by numaner at 9:10 PM on March 9, 2018

Any math nerds in the house? You know what I'm talking about.

Yeah.

LaTeX
BibTeX
BibTeX
LaTeX
LaTeX

Why don't you just compile with one command like you should?!?
posted by JoeXIII007 at 9:52 PM on March 9, 2018 [4 favorites]

"Oh, your expected process is download this form, print it out, sign it, scan the signed version, and email that to you? I don't have a printer, but hey, I've scanned my signature before; I'll just copy that out in Photoshop, remove the line, save as a transparent PNG, paste that into the PDF, resize it to fit the line available; then create form fields to fill out everything that doesn't require a signature; and then - to avoid the weirdness of image blocks loading separately and the risk of someone editing changes into it by accident - save the PDF as a full-page image file, turn that back into a PDF, and OCR it (because I'm keeping a copy, and the OCR'd version will show up on searches later)."

"We don't accept digital signatures."

"Wooden table or GTFO" refers to a workflow antipattern often bemoaned on The Daily WTF involving documents being printed, then placed on a wooden table and photographed with a phone, the resulting image then being uploaded to the Web or attached to an email.
posted by flabdablet at 10:47 PM on March 9, 2018 [1 favorite]

My two favorite quotes from the Acrobat 1.0 promo video:

"Bold is to make up for not being able to use a properly-placed explanation point. And I think it's deplorable."

"Within the email program I usually like to take it, fax it to myself and then I fax the fax to myself so I can see what the other person's gonna get."
posted by bendy at 11:35 PM on March 9, 2018

If we are going to whine about Tex compiling then “dude the line number you told me the error was in is half the freaking document away from where I forgot to end that itemize”..

I feel like it is more appropriate though to complain about people who put out PDF publications with graphics that take several minutes to load, because they didn’t bother to optimize for PDF.
posted by nat at 12:02 AM on March 10, 2018 [3 favorites]

when you copy and paste it you get a line break wherever the text wraps
This is trivial, any basic text editor ‘find and replace’ will fix this.
posted by unliteral at 5:11 AM on March 10, 2018

"Within the email program I usually like to take it, fax it to myself and then I fax the fax to myself so I can see what the other person's gonna get."

The first good reason to install Acrobat way back when was because the New York Times made their condensed "ships at sea" fax edition available as a free download every day (not sure if was web or ftp). It was nifty at the time.
posted by lagomorphius at 7:10 AM on March 10, 2018 [3 favorites]

This is trivial, any basic text editor ‘find and replace’ will fix this.

This is the complaints department, not the solutions department!
posted by jason_steakums at 7:19 AM on March 10, 2018 [6 favorites]

Devonian, do people in your field publish preprints to ArXiV? It lets you download DVI or even the raw source files.

Sometimes, but with journalism it's not always a good idea to work from preprints, it's not always clear that something is in ArXiV and the reference stuff I have to work with is so often in PDF... It's not that it's impossible to get stuff out of PDF, and it's not my only source, it's that it's time-consuming and error-prone, and the same goes for chasing stuff down through other sources. Time and error are the evil co-joined giants of journalism. (And paywalls, but let's not.) Had I but world enough and time, PDF's coyness would be no crime.

The thing about PDFs is that the data is there, and canonical, and digital. It is the standard. And if you can't read the printed output it represents, it doesn't care. I get that it's designed to do exactly that, but there's nothing else close to it - and something like a PDF microscope would transform my life.
posted by Devonian at 5:20 PM on March 10, 2018

What exactly would you like and/or expect a "PDF microscope" to do? Because there may well be existing tools that would work for you.
posted by flabdablet at 3:11 AM on March 11, 2018

I rather imagine there are, but they're hard to find - it's a case of all happy families are alike, all unhappy families are unhappy in their own way, so it's difficult to generally review or classify how well a particular tool helps individuals with (as in my case) quite weirdly busted sight. There are lots of tools aimed at, for example, macular degeneration, a condition with which i share some attributes but not others, and they're mostly useless.

More snowflakes than Scott Antarctic Base. Thus:

I can in general easily read text provided it's set up in a particular mix of font (heavyish, broadish sans) with a particular mix of line and character spacing, a single column that at my optimum font size (which varies according to content) fits perfectly onto the screen - not too much space at the sides, no wordbreaks at line ends. Ink and paper colours need to be very contrasty, but this varies dramatically with colour choice.

Fortuitously, Metafilter Classic is totes Goldilocks. It's why I spend so much time on here. Honest.

Graphics, maps, tables, charts, etc... as above, only it's super useful to be able to individually tune the background colour, so i can see where they stand and end on the page. In fact, being able to tune colour/contrast/line weight differently between body text and graphics would be dreamland, but I've never seen that.

Thin lines are invisible or, worse, bleed over from their ends and continue in my vision far beyond where they end on page. Contrast and colour are even more important here.

All these things interact. Something as simple as screen invert works a lot of the time, especially with browser zoom, except for buggering up graphics.

Cherries on top: easily manipulated preset settings. Supercherry on top: smart sensing of content with best-fit preset application. Not a cherry at all: any of the above with a UI I can use.

Anything that delves into a PDF and puts these handles on the bits inside would be... well, even if ti's 20 percent there and open source, I'd strap on my old coding trousers and help out. (Although for extra bonus points - put your favourite toolchain through the filter described above... super configurable often, usable UI less so, and updates that take me back to Day Zero are not unknown)
posted by Devonian at 4:36 AM on March 11, 2018

Anything that delves into a PDF and puts these handles on the bits inside would be...

disappointing in most cases, I fear, because of the aforementioned presentation vs semantics issue; "the bits inside" will simply not be grouped and/or ordered in any sane way if the PDF you're working with came out of a print driver, which most PDF files do :-(

In your position I'd be giving up on looking for a specifically PDF microscope, and instead searching for an OCR and document analysis application capable of half-decent conversions of input images (in any image format, PDF included) to vaguely well-structured HTML. To the best of my knowledge, the commercial ABBYY FineReader is currently the least bad of these.
posted by flabdablet at 6:03 AM on March 11, 2018 [1 favorite]

Yes, I've gone down that route with Tesseract and may do so again - I got it working for a particular task, and it was quite painful, but I might be able to broaden it out.

ABBYY is expensive and has per-annum page limitations which I'd hit with three 12-page PDFs a day. Will I do that? Probably not, but the free trial is limited to 100 pages, which is really not enough to experiment with and the product is clearly aimed at enterprises, has restrictions on running under VMs... the whole package put me off. Perhaps I should hold my nose and try it, but the combination of enterprise-aimed software + unusual requirements has a poor track record for me. I am not the droid they are looking for.
posted by Devonian at 7:19 AM on March 11, 2018

In my industry (commercial construction), PDF is king. Plans are generated by the architect in CAD and shared with contractors by PDF. We then measure and do material take-offs directly from the PDF. When I generate quotes to customers, I am converting an Excel or Word file to PDF for final presentation. Submittal packages for architect approval require me to take documents from a variety of sources, package it into one file, add markups and comments, and then forward it for review. PDFs have transformed what has traditionally been a very, very paper-intensive process.

Which isn’t to say it’s all sunshine and lollipops. For too many of my subcontractors, a PDF is something that you print off the internet, scan into an email and then forward to me. Image-based PDFs are ridiculously large and you lose the ability to do text searches. Also, they’re almost always scanned upside-down. When architects convert their CAD files, they rarely flatten the layers or divide the file into separate pages. Anyone ever try to do even the most basic work in a 300+mb PDF file? It ain’t pretty. Which brings me to my last complaint — Adobe Acrobat is not that good. Especially when you factor in its cost. The construction industry seems to have adopted Bluebeam as the PDF editor of choice. It is less expensive, more powerful, and most importantly more flexible than Adobe Acrobat.
posted by Big Al 8000 at 9:23 AM on March 11, 2018 [1 favorite]

I've gone down that route with Tesseract ... it was quite painful

Tesseract only does the OCR part though, yes? What's required is something capable of using the positioning of characters and graphics on the "printed" page to infer what's text and what's borders and other graphical elements, which characters form words, which words form text blocks, how the reading flow joins text blocks together, where the paragraph breaks are, and where the graphic blocks are supposed to sit relative to which parts of the text.

That's document analysis rather than mere OCR, and I believe FineReader makes some attempt to do it. There's an open source project called OCRopus that's supposed to do the same thing, but it appears to be progressing at a somewhat glacial pace and I'm not sure how well it works.
posted by flabdablet at 10:19 AM on March 11, 2018

Flavors of AABBY come with various scanners, and iirc lack the page-count limits. I actually sprung for the paid package a while back when I realized it would be helpful for me in doing some data scraping of old business related documents that I needed to extract a bunch of financial data from and that gaining access to reasonable OCR text across a large body of scanner-produced PDFs was something that could save me a bunch of time. I think it ran me about $100; subsequent updates have been free.

It sure as heck does NOT produce sensible, semantically structured PDFs, that leave the text blocks as selectable units and so forth, but it definitely does produce reasonably accurate OCR underlays of the imaged text, meaning the docs then are searchable and indexable.

IIRC, there is or was an online service that provided a similar functionality but specifically with regard to numeric tables, intended to serve as an adjunct to journalists needing to work with governmentally-produced or mandated data reporting, in which the standard report format as noted upthread tends toward PDF and is therefore resistant to provisioning numeric-data export.
posted by mwhybark at 12:01 PM on March 11, 2018

Here's an oldish (2013) article on journalistic data-scraping. The application I was recalling is Tabula.
posted by mwhybark at 12:04 PM on March 11, 2018 [1 favorite]

« Older “...it’s not really about shooting people at all.” | Lard Pour Lard Newer »

This thread has been archived and is closed to new comments

MetaFilter

We are going to hang in there until it works
March 8, 2018 6:59 PM Subscribe

Tags

Share

We are going to hang in there until it works March 8, 2018 6:59 PM Subscribe

Tags

Share

We are going to hang in there until it works
March 8, 2018 6:59 PM Subscribe