Gov.UK to make its lovely HTML exportable as parlous PDFs • The Register Forums

Tuesday 17th July 2018 07:19 GMT AlgernonFlowers4

Print to PDF

On webpage press Ctrl-P to Print, Select PDF as target printer and save as a file (.pdf)

10 2 Reply
1. Tuesday 17th July 2018 07:24 GMT cbars
  
  Re: Print to PDF
  
  I think the key word is accessible. I'm not an expert but I imagine you need to structure the metadata in a particular format to make screen readers understand it; perhaps there is a dev around with a more technically complete answer....
  
  5 0 Reply
  1. Tuesday 17th July 2018 07:28 GMT deive
    
    Re: Print to PDF
    
    Did this almost 15 years ago now, it is dumb, for the reasons stated, however I used https://xmlgraphics.apache.org/fop/ and it wasn't too hard. If your source is pretty well standardised....
    
    3 1 Reply
2. Tuesday 17th July 2018 07:36 GMT A Non e-mouse
  
  Re: Print to PDF
  
  The problem with Ctrl-P is that many smart-arsed web designers implement different style sheets for print and screen because they think they know better. (Our in-house templates, for example, include the URL to links when printing web pages out.)
  
  The most reliable way I've found to print out a web page is, unfortunately, to screen shot it.
  
  14 2 Reply
3. Tuesday 17th July 2018 07:54 GMT stephanh
  
  Re: Print to PDF
  
  You can even automate this with Chrome, since it can be invoked from the command line.
  
  chrome --headless --disable-gpu --print-to-pdf https://www.theregister.co.uk/
  
  See also: https://developers.google.com/web/updates/2017/04/headless-chrome#create_a_pdf_dom
  
  0 0 Reply
4. Tuesday 17th July 2018 08:01 GMT Anonymous Coward
  
  Re: Print to PDF
  
  The issue isn't the generation of the PDF file. It's that they need to format the HTML in particular ways so that the generated PDF is accessible, functional, etc.
  
  Also, they need to convince the coloured pencil department to stop producing PDF files themselves as they're universally shit for accessibility. Produce HTML that can be converted and everyone will be happy.
  
  9 1 Reply
  1. Tuesday 17th July 2018 08:47 GMT stephanh
    
    Re: Print to PDF
    
    "It's that they need to format the HTML in particular ways so that the generated PDF is accessible, functional, etc."
    
    Actually Chrome's print-to-PDF is pretty good at this, frankly. The resulting PDF document is fully searchable, text can be selected, etc.. I presume this means a screen reader would be effective (since clearly the original text is preserved as text). Hyperlinks in the original HTML become hyperlinks in the PDF.
    
    If there is something missing, it would make sense to contribute it to the open-source Chromium codebase rather than invent a wheel with more corners.
    
    Of course, if your original HTML was sh*t from an accessibility POV to begin with, print-to-PDF is unlikely to improve upon the situation.
    
    1 2 Reply
5. Tuesday 17th July 2018 08:35 GMT Anonymous Coward
  
  Re: Print to PDF
  
  it's not that simple as virtual pdf driver (I use it every day to save beeb articles for kids). Trouble is, they print EVERYTHING, when all you want/need is the specific text. Not the links on the left, not a photo SPLIT into two pages (as it the norm with pdf printing). Not that they want people to print pix! :)
  
  6 0 Reply
  1. Tuesday 17th July 2018 10:29 GMT 's water music
    
    Re: Print to PDF
    
    it's not that simple as virtual pdf driver (I use it every day to save beeb articles for kids). Trouble is, they print EVERYTHING, when all you want/need is the specific text. Not the links on the left, not a photo SPLIT into two pages (as it the norm with pdf printing). Not that they want people to print pix! :)
    
    I find the tools at printwhatyoulike.com quite helpful for this. You can easily mark regions/objects to exclude. I have used the Chrome addin and the JS bookmarklet
    
    0 0 Reply
    1. Tuesday 17th July 2018 21:24 GMT Roland6
      
      Re: Print to PDF
      
      >I find the tools at printwhatyoulike.com quite helpful for this.
      
      I've used the extension/add-in Print Edit WE (Chrome/Firefox) for this job.
      
      However, in using such tools, you do get an appreciation of just how varied HTML is and the ease, or not in some cases, with which a webpage can be reduced to it's substantive content. I'm sure this variable quality of HTML plays havoc with accessibility tools.
      
      0 1 Reply
6. Tuesday 17th July 2018 09:18 GMT Tom 7
  
  Re: Print to PDF
  
  Loose printed PDF. Oh! Look its still on the fucking internet I can find it there. Print and repeat.
  
  Pointless waste of trees/computer space.
  
  Pointless Document Format!
  
  1 9 Reply
  1. Tuesday 17th July 2018 10:45 GMT Doctor Syntax
    
    Re: Print to PDF
    
    "Oh! Look its still on the fucking internet I can find it there."
    
    Except when it isn't, not even on archive.org and assuming I only need to be able to see it when I have an internet connection.
    
    4 0 Reply
    1. Thursday 19th July 2018 09:05 GMT Anonymous Coward
      
      Re: Print to PDF
      
      And, as someone who's been clearing up the estate of a recently deceased relative, I can tell you PDFs have their place. That's one job that would be a LOT easier if the mountain of paperwork had been scanned and OCR'd to PDFs rather than being randomly shoved in boxes over 30 years.
      
      Re: the Internet connection, we're not there yet. When access is 100% ubiquitous, cloud services manage to run years with zero downtime, and companies don't bail out of providing their services with almost no notice.... then offline PDFs and paper will have had their day. Today, the reality is that you just cannot rely on accessing online information when you need to.
      
      1 0 Reply
Tuesday 17th July 2018 07:33 GMT A Non e-mouse

Been there, done that

One of XSLT's selling points was that you could take some XML structured data, and with the relevant XSLT files, you could transform into HTML, PDF (Via Formatted Objected), RTF, etc.

The reality of it is a tad bit harder, though.

6 0 Reply
1. Tuesday 17th July 2018 10:40 GMT Doctor Syntax
  
  Re: Been there, done that
  
  "take some XML structured data"
  
  If only HTML were XML structured data.
  
  1 2 Reply
  1. Tuesday 17th July 2018 14:52 GMT stephanh
    
    Re: Been there, done that
    
    "If only HTML were XML structured data."
    
    That would have been nice. But since XHTML went nowhere, it isn't.
    
    1 0 Reply
  2. Tuesday 17th July 2018 15:05 GMT Arthur the cat
    
    Re: Been there, done that
    
    If only HTML were XML structured data.
    
    If only XML data was Lisp S-expressions.
    
    1 0 Reply
  3. Tuesday 17th July 2018 15:54 GMT Destroy All Monsters
    
    Re: Been there, done that
    
    If only HTML were XML structured data.
    
    Nothing that a deep-learning neural network can't fix.
    
    1 0 Reply
Tuesday 17th July 2018 07:46 GMT Aladdin Sane

"But we've always done it this way"

When "This way" happens to be fucking shit and no longer fit for purpose.

6 1 Reply
Tuesday 17th July 2018 07:56 GMT nuked

'Long term roadmap' aka the list of things we will never do because by the time we find money for it the next genius would have taken over with a new strategic platform idea.

9 0 Reply
This post has been deleted by its author
1. Tuesday 17th July 2018 08:29 GMT Anonymous Coward
  
  Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”
  
  I think the Old Man and the Sea is still in copyright. If so, there are better ways of getting it than that link.
  
  1 2 Reply
  1. This post has been deleted by its author
2. Tuesday 17th July 2018 10:42 GMT Doctor Syntax
  
  Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”
  
  "Aren't these people meant to use plain English?"
  
  Whatever gave you that idea? This is GDS. Whatever they're doing it has to be buried under the most opaque mounds of gibberish to stop anyone finding out.
  
  3 0 Reply
  1. Tuesday 17th July 2018 15:54 GMT Anonymous Coward
    
    Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”
    
    Probably something from "Heart of Darkness". It would make sense.
    
    1 0 Reply
Tuesday 17th July 2018 08:08 GMT steelpillow

History repeats itself

Are you sure this is .gov.uk and not Wikipedia? The same leisurely approach to customised HTML-to-PDF conversion is under way in both, and Wikipedia have made a .gov.uk-style ballsup of their first two stabs at it (wrt stephanh's comment, round two was a fruitless attempt to make headless Chrome fit for purpose) and in desperation have outsourced Round Three to their book publisher. It's so the same story in different clothing.

HTML5 sucks in more ways than most folks, including .gov.uk, realise >cough< offline >cough< javascript >cough< information layout >cough< and pdf, done properly, has a lot going for it in its own niche. But I have to ask, if dual-media publishing from a single source is the aim, then why fuss about accessibility of the pdf when you can have to flippin' access the html edition in the first place in order to get to it?

1 1 Reply
1. Tuesday 17th July 2018 09:18 GMT Anonymous Coward
  
  Re: History repeats itself
  
  "why fuss about accessibility of the pdf when you can have to flippin' access the html edition in the first place in order to get to it?"
  
  Probably related to the Public Sector Bodies (Websites and Mobile Applications) Accessibility Regulations 2018.
  
  4 0 Reply
Tuesday 17th July 2018 08:25 GMT Andrew Yeomans

Multi-page documents

The other advantage of a *good* HTML to PDF system is the ability to select multiple web pages, and combine them into a single PDF document, with sections in the correct order.

For example, try to print the NCSC CLoud Security Principles starting from https://www.ncsc.gov.uk/index/topic/151. Similarly try printing appropriate employment and tax pages. The next trick is to make it print double-sided.

I have - once- come across a system which would let you select the desired sections of a larger set of documents, then it would generate a single PDF of them all, in a suitable format for printing.

1 0 Reply
1. Tuesday 17th July 2018 21:08 GMT Roland6
  
  Re: Multi-page documents
  
  I have - once- come across a system which would let you select the desired sections of a larger set of documents, then it would generate a single PDF of them all, in a suitable format for printing.
  
  Expert PDF from Avanquest used to have this feature. There were times when it was really useful, for example taking a printout of a shopping basket (full list of part numbers and descriptions of content and prices) and then a printout of the final checked out order (basic item details and pricing).
  
  0 1 Reply
Tuesday 17th July 2018 08:36 GMT Peter Prof Fox

Stand alone, reliable documents

A PDF is self-contained.

An HTML document has css links and scripts (and trackers)

A PDF can be reliably printed and passed around. (A lot of people are not digitally agile)

An HTML document requires a computer/device, browser and the knowledge to use it. A hard copy to get signatures on is pot-luck.

A PDF can be reliably stored as reference. I have it. I can archive it and index it.

An HTML user manual (say) is moved, deleted, or updated to reflect model 2 features but not my model 1

What they should be doing is banning Word documents.

19 4 Reply
1. Tuesday 17th July 2018 09:21 GMT Tom 7
  
  Re: Stand alone, reliable documents
  
  A PDF is out of date the minute it is made. The online HTML should be up to date, should be legible on phone, tablet or PC and does not self shuffle in the print tray or drink coffee.
  
  2 10 Reply
  1. Tuesday 17th July 2018 10:49 GMT Doctor Syntax
    
    Re: Stand alone, reliable documents
    
    "A PDF is out of date the minute it is made."
    
    A fact which is extremely problematic for those in govt. who might have a shifting relationship with what they said a minute ago and very handy for those who want ot hold them to account.
    
    TL;DR? Permanence has value.
    
    3 0 Reply
2. Tuesday 17th July 2018 09:21 GMT Pascal Monett
  
  Re: Stand alone, reliable documents
  
  A pdf also requires a computer/device, the knowledge to install a PDF reader and the ability to use it.
  
  If you're talking about printing then I don't care if you printed from a web page, a Word document or a PDf - it's printed and that's the end of the problem.
  
  0 3 Reply
  1. Tuesday 17th July 2018 09:41 GMT ibmalone
    
    Re: Stand alone, reliable documents
    
    A pdf also requires a computer/device, the knowledge to install a PDF reader and the ability to use it.
    
    Because I so often read HTML over the air and straight into my brain without using any device or software.
    
    I don't know any current consumer OS that doesn't have a PDF reader. Windows - Edge does it. Linux - KDE has Okular, Gnome has Evince. Mac OSX - Preview. Android has Google's pdf reader. Both Chrome and Firefox will have a stab at it on desktop OSes.
    
    In practice, PDF is handy for archiving documents. HTML doesn't work as well because in most cases it requires storing resources alongside it (though yes, you can base64 encode images and stuff them in), and how browsers interpret it changes over time, while display of PDF is more stable and there is the PDF/A standard. Whether the resulting document is accessible / searchable largely depends on the source document, if it was structured text (LaTeX, markdown, office documents, XML, and yes, even HTML) with a sensible interpreter then the resulting PDF can be accessible. If it was scanned pages of an article from 1950 then no, but the HTML version isn't going to be either.
    
    3 1 Reply
    1. Tuesday 17th July 2018 11:08 GMT phuzz
      
      Re: Stand alone, reliable documents
      
      I don't know any current consumer OS that doesn't have a PDF reader. Windows - Edge does it. Linux - KDE has Okular, Gnome has Evince. Mac OSX - Preview. Android has Google's pdf reader. Both Chrome and Firefox will have a stab at it on desktop OSes
      
      Half of those you list are actual web browsers. You know, software designed originally to parse HTML?
      
      At this point we can safely say that HTML and PDF are (roughly) about as easily accessible on any electronic device as each other. Not least because that most of the software for reading HTML will also display a PDF and vice-versa.
      
      Mind you, basic HTML is at least somewhat human readable in a text viewer, which isn't something you can say about PDF.
      
      1 1 Reply
      1. Tuesday 17th July 2018 11:38 GMT ibmalone
        
        Re: Stand alone, reliable documents
        
        Hardly the point really. I was pointing out that the idea it's hard to read PDF belongs back in 1995. And yes some of them are web browsers (making "about half of them" if you include firefox and chrome which I tacked on as additional examples of software you almost certainly already have).
        
        You'll find those web browsers also display images, video, audio, plain text and will have a stab at displaying XML. Is HTML a substitute for all of those? Will the available version of those browsers display the same HTML document the same way next year? If you're displaying plain text why not just use plain text? Or markdown? It turns out different tools have different uses.
        
        1 0 Reply
Tuesday 17th July 2018 09:13 GMT xyz

The way government works is this....

Oxbridge types (of the blue sky persuasion) do not use computers; they want a hard copy (emails, info etc) from their "girls" and still give dictation.

SPADs and other assorted climber-upers only believe in something if it's in Excel.

Managers tell their "girls" to type stuff into Word, save as pdf and slap it on their intranet page.

Only "girls" (and other data entry types) use "working class" html.

You can bet that behind this "necessity" is some crusty who wants his "girl" to send him an email with a pdf attachment so he can print it off.

To give you an idea of the arseness available... one top dog was on hols in France and was viewing a 320 page document on the UN web site, he wanted a copy so he phoned his "girl" in London and told her to print it off and fax it to him. I am not joking.

2 5 Reply
1. Tuesday 17th July 2018 10:52 GMT Doctor Syntax
  
  Re: The way government works is this....
  
  SPADs and other assorted climber-upers only believe in something if it's in ~~Excel~~ Powerpoint.
  
  FTFY
  
  1 0 Reply
Tuesday 17th July 2018 10:05 GMT tiggity

XML

Store underlying data as XML - nice and simple content with some basic description

Run appropriate transform(s) to give HTML (the descriptive elements in the XML give appropriate HTML)

Run different transform(s) to give PDF.

Things like XSL-FO are your friend

It works nicely (did some proof of concept stuff on this ages ago, back when mobile devices had weedy screens, - same content gave desktop HTML, mobile HTML and PDF by running appropriate XSL)

2 0 Reply
Tuesday 17th July 2018 10:11 GMT Anonymous Coward

PDF is the work of the devil...

So a natural fit for the civil service.

1 1 Reply
1. Tuesday 17th July 2018 15:13 GMT Arthur the cat
  
  Re: PDF is the work of the devil...
  
  So a natural fit for the civil service.
  
  To get theological, the Civil Service are so "on the one hand, on the other hand …" the Devil would reject them and they'd end up in the Vestibule of Hell, chasing deviceless banners and being stung by hornets.
  
  I want a Dante icon.
  
  1 0 Reply
Tuesday 17th July 2018 10:20 GMT Cem Ayin

If your only tool is a hammer...

Both formats have their strengths and weaknesses; wise guys choose whatever suits the job at hand best.

Yes, PDF /is/ print-oriented - and that's a major advantage for publishing long texts that require attentive reading. A document set in a reader-friendly font with proper paragraph filling and hyphenation is so much easier on the eyes; it lets your mind focus on the content rather than the technicalities of a poor text rendering (which is the norm in HTML). I speak from experience, I do read a lot.

And I'm not alone. I work in an academic setting and at our lab, the computing devices most in demand ("high demand" being defined as "users scream /immediately/ when it fails") are 1. the personal laptop and 2. the workgroup printer - and that's for a reason. /Nobody/ would want to read a scientific paper as HTML on the screen, with the poor rendering constantly distracting the mind from the problem at hand. (Some folks do use rotating monitors for reading papers, but it is PDF they read on the screen in portrait format.)

And I haven't even mentioned the problem of embedded figures yet: good luck with copying the full content of a HTML page (skipping unneeded navigation code) for offline reading...

That is to say, there are use cases where HTML is simply no go.

The optimal use case for HTML (plus JS where that really makes sense) on the other hand is short, frequently changing or short-lived documents that noone would want to read offline or in print; or documents of a highly interactive nature; or reading the same document on a wide range of display sizes (making allowances for the text layout and rendering) - that's what it was designed for after all.

Bottom line: Use a hammer for nails and a screwdriver for screws. Heated ideologic debates as to whether screws are outdated and should universally be replaced with nails are frankly daft.

(And yes, both formats have rather more than their fair share of warts. A text format that is versatile enough to cover both use cases would be really nice to have. Good luck with developing something of the kind *and have it widely accepted by your audience*...)

7 0 Reply
1. Thursday 19th July 2018 09:17 GMT Anonymous Coward
  
  Re: If your only tool is a hammer...
  
  15th Standard: https://xkcd.com/927/ :)
  
  0 0 Reply
Tuesday 17th July 2018 10:48 GMT Nick Kew

Not reinventing the wheel

As many commentards have noted, this is a frequently-solved problem. A decent minority of historic HTML/PDF solutions take the accessibility issues seriously.

I expect what the gov.uk chap means is that they'll take some such thing - probably XML-based - and integrate it into their own publishing.

That is, unless and until such a sensible goal gets lost under a weight of empire-builders and PHBs.

1 0 Reply
1. Tuesday 17th July 2018 10:55 GMT Doctor Syntax
  
  Re: Not reinventing the wheel
  
  "That is, unless and until such a sensible goal gets lost under a weight of empire-builders and PHBs."
  
  This is GDS. Of course that will happen.
  
  1 0 Reply
Tuesday 17th July 2018 12:17 GMT SVV

What's the betting

that all the senor civil servants at the Government Digital Service require every document to be printed out for them on paper? Rather than reading them digitally?

1 0 Reply
Tuesday 17th July 2018 12:23 GMT fpx

Offline Reading

I frequently print web pages to PDF for storage and offline reading. In my experience it's not the PDF that goes out of date but the online content disappears or gets modified. The "1984" experience where information is centrally controlled and modified as necessary is easier to pull off every day.

The article says that "most [PDFs] come into existence because designers want total control." Unfortunately that's very much the same for Web content, where every element is arranged down to the pixel, images are deferred-loaded so that they can track when and how far down you scroll, and random ads appear all over the place as you move around.

2 0 Reply
Tuesday 17th July 2018 18:37 GMT Registered Register Registrant

HTML sucks to read

HTML is practical, quick, and dirty, but as Cem Ayin suggests, it sucks to read. A DVI file typeset with TeX in 1982 still reads better on a desktop monitor than any Wikipedia or gov.uk page today, 36 years later, and any serious reader would prefer a well-typeset PDF on the screen. Serious reading is not some artifact of "ingrained print culture".

3 2 Reply
Wednesday 25th July 2018 01:12 GMT David 164

Let be honest most of it is because the people writing those PDFs are force to make them public by law or convention. If it was their choice they would be printed off and locked in some filing cabinet where members of the public or the media or even MPs would have to fight through a pile of bureaucracy to get to them.

0 0 Reply