back to article How a tax form kludge gifted the world 25 joyous years of PDF

HTML is the world's most common digital document file format. However, it's not the one everyone turns to when they want to create a precise document that looks, prints and behaves the same on any platform on any device. And it's hardly the format of choice for immediate offline reading, easy sharing or simple portability. For …

Page:

      1. VinceH

        Re: Ahem

        I thought that was XXX

    1. WolfFan Silver badge

      Re: Ahem

      I have Acrobat X Pro. So far it reads all PDFs I've thrown at it. I am in the process of de-Adobifying my systems; I will not be getting a newer version of Acrobat. Ever. My PDF needs are simple. I must be able to create PDFs from scans, including from scans from automatic document feeders on assorted scanners, copiers, and multifunction devices. I must be able to combine assorted elements into PDFs, including previously scanned files in PDF, PNG, JPG, TIFF, or GIF format. (You'd be amazed how many pretenders to the Acrobat throne can't handle GIFs...) I must be able to have basic OCR, which generates DOC, DOCX, or RTF files which don't have too many errors. (I can point the file to a dedicated OCR app, usually ReadIRIS, if necessary. The PDF just has to have good enough resolution.) In particular I must be able to generate PDFs from assorted other elements into a single PDF which has sufficient resolution to do OCR if necessary or to just be usable as is, depending on what we want to do with the assorted stuff. (This, of course, means that any image files MUST be scanned in or otherwise generated at a high enough resolution to be useful; anyone who hands us 100 dpi images gets laughed at. This in turn usually means that we get the original document and scan it in ourselves, most people simply scan in at far too low a resolution or use silly formats or both. GIF, I'm looking at you. And BMP. And the idiot who still uses PICT; yo! moron! Apple hasn't used PICT in nearly 20 years! Sigh.)

      1. Chris J

        Re: Ahem

        If you are comfortable with the command line, ImageMagick, OCRmyPDF and PDFTk will probably meet those needs. It looks like you could even script LibreOffice to do the Word file creation from the OCRd PDF: https://stackoverflow.com/questions/44342224/pdf-to-doc-docx-converter

        1. Doctor Syntax Silver badge

          Re: Ahem

          "script LibreOffice to do the Word file creation from the OCRd PDF"

          Oh, the irony.

    2. fidodogbreath

      Re: Ahem

      Next year it will probably be Acrobat DC 1880 just to keep us guessing

      One thing we won't have to guess about: if it's an Adobe cloud product, it will be eye-wateringly expensive.

  1. Christian Berger

    PDF can be cool... if you stay away from Adobe

    I mean look at computer magazines like PoC||GTFO which are distributed as PDF files. Those are usually polyglots which use all the sensible features of PDF (so only a tiny fraction) and usually are polyglott. One issue even had the hash of itself printed on the title page.

    One should note that Postscript as a document format has severe security problems, as this is actual code by design. Plus the firmware update feature of many printers works via sending them Postscript. So it would be possible to have a Postscript file having multiple exploits inside owning both your computer _and_ your printer.

    1. phuzz Silver badge

      Re: PDF can be cool... if you stay away from Adobe

      Wait, wasn't it PoC||GTFO who produced a PDF, that also functioned as a valid NES ROM that would display the MD5 sum of itself?

      Oh, yes it was:

      Technical Note: This file, pocorgtfol4.pdf, is a polyglot valid as a Nintendo Entertainment System (NES) ROM cartridge, a PDF document, and a ZIP archive. We collided 9,824 MD5 block pairs to place the hash of this document on its front cover and the title screen of the NES game, but only 609 of them made it to the final release.

      That's damn impressive.

      (source)

    2. Michael Wojcik Silver badge

      Re: PDF can be cool... if you stay away from Adobe

      PoC||GTFO is a wondrous thing (and let us not forget that two volumes are also available as lovely hardbound books). But using it as an example of the virtues of PDF is a bit like using the Bugatti Chiron to argue that cars are pretty fast. It's something of an edge case, surely.

  2. Anonymous Coward
    Anonymous Coward

    As with everything, has it's uses.

    Incidentally, a lot of "book" PDFs are really a collection if *images* of the pages. Not that it really matters. It's a total abomination either way.

  3. steelpillow Silver badge
    Boffin

    Print Definition File

    History is always written by the winners, so this particular history is about to die.

    PDF originally stood for "Print Definition File". Text was just glyphs, you couldn't even copy-paste from a PDF. But all you needed was a PostScript printer driver and its appearance was guaranteed.

    Technology moved on. Hi-res screens you could read an illustrated print page on appeared, people really did need to copy-paste content, the web and hyperlinks spread like wildfire, you could even embed multimedia in your online document but you couldn't print that off. At the same time, print publishing houses were rejecting PDF and demanding MS Word, specifically so they could edit the content. Nobody in the SOHO market, where the sales volumes are, gave a toss how the castrated print version looked any more. And by now the publishing and print houses already had their print needs catered for by last year's version.

    So Adobe tried to reinvent PDF as an electronic format for onscreen viewing. Embedded text, media and hyperlinking appeared, along with other gimmicks I barely knew and thankfully forget. The name "PDF" suddenly now stood for "Portable Document Format" and the Illustrator airbrush was whipped out on the old Print Definition File.

    But page resizing is a key to onscreen comfort and paper print pages just can't cut it. In the face of more flexible electronic formats such as the XHTML/XML based ePub, the days of PDF as anything more than a print definition format are severely numbered. And even there, the relentless march of page layout markup in HTML/CSS leaves a question mark.

    1. myhandler

      Re: Print Definition File

      Print publishing houses only wanted Word for author's text - everything else was in QuarkXPress.

      There's still an enormous requirement for fixed documents that always appear the same.

      1. steelpillow Silver badge

        Re: Print Definition File

        Print publishing houses only wanted Word for author's text - everything else was in QuarkXPress.

        Yes indeed. Word > QuarkXPress > PDF it was.

        There's still an enormous requirement for fixed documents that always appear the same.

        That's where HTML/CSS keeps growing new features (I am never sure whether this is an abomination or a stroke of genius).

        Still, I acknowledge it'll take a lot for Scan > OCR > PDF ever to go away.

  4. poohbear

    PDF uber HTML

    If I remember correctly, back in the 90s, Adobe was trying very hard to have PDF adopted as the standard for WWW. Thankfully HTML won out.

    1. Allonymous Coward
      Coffee/keyboard

      Re: PDF uber HTML

      I vaguely remember that too. IIRC back around Acrobat 4-ish I was reading the (PDF, natch) help file and it was going on about the advantages of PDF over HTML - formatting accuracy, multiple pages per file etc probably.

      Icon, because I was just a little bit sick on my keyboard remembering that.

  5. Anonymous Coward
    Anonymous Coward

    I had to copy data from pdf format into xml

    It was so gratifying when I finally convinced the subject expert that giving me the original of her report was a good idea because I could guarantee every number was copied correctly thanks to copying and pasting the lot to Excel, saving as csv and writing a simple little script.

  6. Bavaria Blu
    Windows

    PDF will only be used for archiving documents in future

    If I think of what I find PDFs being used for these days, only the need to archive documents stops PDF from becoming obsolete. Cinema or theatre tickets, airline tickets are both going to be superseded by apps or a 2D barcode sent by email. Utility bills / bank statements, tax returns are all usefully downloadable in PDF. Some big corporate websites seem to offer a downloadable version of their web pages as PDF, perhaps that would useful 10 years ago before browsers could save as PDF.

    I remember seeing an advert for a competitor product to Acrobat in a US MacWorld in about 1993 - there was a bloated and rotund man in a leotard (mocking an Acrobat) splattered in mud as if he had fallen down during a circus routine. Can anyone name it? Google doesn't help my nostalgic trip down memory lane!

  7. Giovani Tapini

    Ahh, the memories

    File format arbitrage was not limited to Adobe as my now addled brain starts to recall.

    in those days everywhere had at loads of different and incompatible word processors and publishing suites.

  8. My other car WAS an IAV Stryker

    PDF Forms

    Drop the submit functions. Just allowing Acrobat Reader to fill out the form, save it and print/email it helps a lot of folks. With email or web-based upload*, there is ZERO need to send the form contents from within a PDF reader.

    * Yes, us readers of El Reg know that FTP not only exists but still has some uses. Most lusers never knew about FTP (thinking email was the only pre-web internet app that mattered; they don't know/care about telnet either except in movies) and many of the others think FTP is essentially dead.

  9. richardcox13
    Boffin

    Jobs Didn't Introduce Typography to Computers

    The Mac did in 1985 make limited typography available to those few who could afford a Mac. But typography had been available long before that and in cheaper form.

    For a start both troff and TeX seriously predate the Mac or Lisa.

    1. Primus Secundus Tertius

      Re: Jobs Didn't Introduce Typography to Computers

      The pdf format has replaced the dvi format used by Stone Age Latex, prior to the modern GUI interfaces that make Latex almost useable.

      1. Yet Another Anonymous coward Silver badge

        Re: Jobs Didn't Introduce Typography to Computers

        The pdf format has replaced the dvi format used by Stone Age Latex

        So PDF can seamlessly render to any output at any resolution now?

        Arguably ps could replace dvi

    2. Anonymous Coward
      Anonymous Coward

      Re: Jobs Didn't Introduce Typography to Computers

      > The Mac did in 1985 make limited typography available to those few who could afford a Mac.

      I guess the author was referring to the NeXT's display system.

      1. Michael Wojcik Silver badge

        Re: Jobs Didn't Introduce Typography to Computers

        I guess the author was referring to the NeXT's display system.

        He specifically mentions the LaserWriter in the same paragraph.

    3. Michael Wojcik Silver badge

      Re: Jobs Didn't Introduce Typography to Computers

      Yes. The line about "Steve Jobs introduced typography into computing" is complete rubbish.

      TeX was released in 1978, so a good 7 years before Steve Jobs and the LaserWriter. Even the first version of PostScript was only released in 1982.

      troff was just one descendant of CTSS RUNOFF, from 1964. Arguably RUNOFF didn't do much in the way of "typography", but it did lay out text. troff appeared a couple of years before TeX (circa 1976) and did quite a lot of typesetting.

      Perhaps the "Apple fanboi legend" and some paywall-protected Forbes page aren't ideal sources.

      1. Roland6 Silver badge

        Re: Jobs Didn't Introduce Typography to Computers

        >Yes. The line about "Steve Jobs introduced typography into computing" is complete rubbish.

        Whilst dismissing this we shouldn't overlook Xerox, with XNS, and the Xerox print language that John Warnock and Chuck Geschke worked on before founding Adobe...

  10. Anonymous Coward
    Anonymous Coward

    Not a great PDF user or expert....

    ....but what annoys me about the Adobe reader is that it's massive (above 100MB (IIRC)) - just to view a file!?

    I remember playing about with other readers with varying degrees of success, but being rather impressed with Samutra(?) only being a couple of MB.

    1. Primus Secundus Tertius

      Re: Not a great PDF user or expert....

      Try the free Nitro pdf, if you can find it. Google knows, you know.

    2. Yet Another Anonymous coward Silver badge

      Re: Not a great PDF user or expert....

      sumatra pdf reader perhaps ?

    3. Anonymous Coward
      Anonymous Coward

      Re: Not a great PDF user or expert....

      but what annoys me about the Adobe reader is that it's massive (above 100MB (IIRC)) - just to view a file!?

      https://mozilla.github.io/pdf.js/

      Adobe were never known for code quality or efficiency.

  11. Michael H.F. Wilkinson Silver badge
    Pint

    I find PDF highly useful

    PDF generated using pdflatex works a treat for me. I can send my pdf slides, or poster design, or article anywhere and they look (and print) just fine. Try running a powerpoint presentation with equations in it on some PC at a conference. All too often some font is missing and the equations are completely messed up. Presenting a pdf file instead using any PDF reader works fine every time. I can add basic animations using the pdfanim package in slide if I want to. LaTeX may not be the tool of choice for many, but for me it is ideal, and allowing export to PDF rather than DVI files means I can share my work with others easily.

    1. Primus Secundus Tertius

      Re: I find PDF highly useful

      I also find pdf very useful: in many voluntary organisations. If you send a Word document, someone with an old version tells you they can't handle .docx files. If you send a membership list as an Excel file (the traditional cheap database) they ask you for some other format, pdf or csv.

      Email is, of course, HTML based; but with Microsoft HTML thingies that help to reconstruct Word documents but are ignored by other software.

      1. Yet Another Anonymous coward Silver badge

        Re: I find PDF highly useful

        I sent a PDF version of a resume to one software company and was asked to send a word doc instead.

        Because they have a virus scanner for word docs ......

        1. Anonymous Coward
          Anonymous Coward

          Re: I find PDF highly useful

          Considering I just ran into a .pdf with malware embedded, I'd bounce pdf's here as well. One of my tripwires caught it, thankfully.

      2. frank ly

        Re: I find PDF highly useful

        "Email is, of course, HTML based; ..."

        I set my email client to block all HTML content. If you can't type it or attach it, I won't look at it.

        1. Ken Hagan Gold badge

          Re: I find PDF highly useful

          "I set my email client to block all HTML content. If you can't type it or attach it, I won't look at it."

          You realise that, technically, the HTML content *is* an attachment?

      3. Doctor Syntax Silver badge
        Devil

        Re: I find PDF highly useful

        "Email is, of course, HTML based"

        You were doing well up to this point. HTML email is an abhorrent waste of bandwidth and spawn of the marketing brethren (see icon).

      4. Anonymous Coward
        Anonymous Coward

        Re: I find PDF highly useful

        > Email is, of course, HTML based;

        AAAAAARRRRRRGGGGGGGGHHHHHHHHHH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  12. Simone

    Open format !!

    Forget the forms and other nonsense. The basic PDF functionality is a standard and has not changed for years. I don't think people realise how useful this is for keeping archives of useful information. Reference information that does not go stale after a few months, or is useful for historical analysis. Having a page 17 that always contains the same information makes it a lot simpler to refer to something in a document in PDF format than in something that reflows.

    Adding a PDF output printer driver to a device allows (nearly) any software to create an electronic version of whatever it is showing that (usually) stays the same and can be opened years later. How many government organisations, and similar, publish 'stuff' in Word? How many can still be opened today? The UK National Archives have been investigating keeping old technology and software so some of their material can still be referenced in the future (I think I remember this).

    If we want things that can be read for a long time to come, is there anything as robust as PDF? Why do we need Adobe Reader, when there is a lot of other reader software available?

    1. Anonymous Coward
      Anonymous Coward

      Re: Open format !!

      > If we want things that can be read for a long time to come, is there anything as robust as PDF?

      Clay tablets.

    2. Michael Wojcik Silver badge

      Re: Open format !!

      If we want things that can be read for a long time to come, is there anything as robust as PDF?

      Digital document preservation and archiving is a large and very active field. As with any such, the guesses non-specialists make about it are not likely to be particularly accurate or useful.

      There's a decent short introduction to the subject by David Anderson in the December 2015 issue (58.12) of CACM. Anderson mentions the #nodigitaldarkage discussions on Twitter that were sparked by Vint Cerf's "Digital Dark Age" arguments, and such projects as POCOS and E-ARK. Interested readers may also want to investigate historical efforts such as Acid-Free Bits or the long debates about human-readable versus machine-readable formats, and so forth.

  13. This post has been deleted by its author

  14. Anonymous Coward
    Anonymous Coward

    I have a soft spot for PDF

    It's down near the swamp by the willow tree...

  15. Anonymous Coward
    Anonymous Coward

    PDF is clunky.

    It really took off for me when it was added for free in several apps. It sucked, but it was usable.

    Before that, it sucked even harder. Once upon a time, a certain company that I worked for, lost the original Word .DOC with 5.000 pages worth of documentation, that needed to be revised. However, they had a PDF of it. No program would be able to get that PDF and save as a Word DOC... and they were planning to use a pool of 10 secretaries to type everything AGAIN.

    However, I had an OCR program that could read PDF files instead of a scanner as source and recognize them back to Word documents, despite losing some formatting, failing to recognize pictures... back in the early 2000's. (Omnipage could do that on version 12, I guess.)

    It took several minutes to recognize each page and save them to a single DOC file (the program created temporary BITMAP VERSIONS OF EACH PAGE, consuming several MB of hard drive scratch disk causing the computer to crap itself every 150 pages)... but it worked.

    5.000 word docs, 72 hours, and 40 memory leak crashes later, in a Pentium II with windows 98, the thing could be revised and copy-pasted. One secretary shed a tear when I showed her every page in its lonely word doc, with a few errors, but saving days of work for her.

    All of that because you can't convert a PDF back to anything else. Sure, using an OCR is cheating but you needed to get the job done. Screw copyrights.

    1. Yet Another Anonymous coward Silver badge

      Re: PDF is clunky.

      pdftotext - available on a unix machine near you.

      PDF isn't designed to convert back into text, in the same way that gcode isn't easy to turn back into a Solidworks model

    2. fidodogbreath

      Re: PDF is clunky.

      All of that because you can't convert a PDF back to anything else.

      Well, you couldn't in 1998. Now there are tons of free and low-cost tools that can convert PDF content into numerous other formats.

      Screw copyrights.

      Including the copyrights that protected the product for which you had 5K pages of documentation? Or just copyrights that belong to someone else?

    3. e_is_real_i_isnt

      Re: PDF is clunky.

      You are right - Omnipage has always sucked. Blaming PDF for the problem is like blaming cars for crashing. The "F" is for format - a format that allows a huge flexibility to the user of same, including making it difficult to extract information if they so choose. If your organization had a tough time extracting the info, then it's your organization that was at fault for making extraction difficult.

      1. Doctor Syntax Silver badge

        Re: PDF is clunky.

        "If your organization had a tough time extracting the info, then it's your organization that was at fault for making extraction difficult."

        They may have done it because they were frightened of someone extracting the text from it.

    4. Doctor Syntax Silver badge

      Re: PDF is clunky.

      "Once upon a time, a certain company that I worked for, lost the original Word .DOC with 5.000 pages worth of documentation"

      No backups? Careless. And why put 5,000 pages into a single document?

      1. doublelayer Silver badge

        Re: PDF is clunky.

        Whatever reason they may have had for losing the document, their mistakes were not the point. The point is that PDF files, although they are lauded as being useful on any platform, frequently lack the feature of making their content available if you don't want to just look at them. Some PDFs have text that can be extracted, but the number that don't is higher than the number that do. If I want to use the contents for some reason, be that copying and pasting code, quoting accurately, or sending data over something where text is more convenient*, PDFs frequently won't work. Sometimes, this is done for security, because I suppose it would be harder to violate copyright with something where copy and paste are made impossible, but usually it's down to someone messing something up or being a control freak because I should view this document in the font they like. With any text-based format, you have the freedom to make it useful by converting it to any format that would work well. The greatest risk is that it won't look as nice on the other end. With a PDF, the message seems to be that you are not allowed to do anything that the original document-writer didn't think of allowing you to do.

        *Recently, I wanted to give someone some of the documentation for a system they were using. The only problem was that they were on the other side of an e-mail exchange. I can't send the PDF file because it's 48 mb and there's a limit on attachment size. This file was sent to me, so I don't have a link to it online. I could post it somewhere and let them download it, sure, but copying and pasting the ten-item instruction list would really have been more convenient.

        1. Doctor Syntax Silver badge

          Re: PDF is clunky.

          "Sometimes, this is done for security, because I suppose it would be harder to violate copyright with something where copy and paste are made impossible, but usually it's down to someone messing something up or being a control freak because I should view this document in the font they like."

          In general I find that PDFs generated from a word processing document copy and paste just fine. If they don't then it's most likely a deliberate act. But being able to control the presentation in this way is the purpose of PDF. If people have taken advantage of that it's a little unreasonable to blame the format for that. They didn't want you to take the text out. That may rebound on them later but if so it's a problem of their own making.

          PDFs generated from a scan are a different kettle of snakes. At best they've been OCRed into something very approximately resembling the text. At worst you have to hope you can find an OCR program that can deal with the font and the condition of the document that was scanned. If the original was a printed book you have to hope it was early in the print run.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like