back to article Internet pioneer Vint Cerf predicts the future, fears Word-DOCALYPSE

Big data may turn out to be a big mystery to future generations, godfather of the internet Vint Cerf has warned. The pioneering computer scientist, who helped design the TCP/IP protocol (along with Robert Kahn) before going on to work as chief internet evangelist for Google, has claimed that spreadsheets, documents and various …

COMMENTS

This topic is closed for new posts.

Page:

  1. hplasm
    Meh

    Vint says:-

    "I'm not blaming Microsoft..."

    Why the hell not? It's their fault.

    1. Don Jefe
      Meh

      Re: Vint says:-

      Be realistic, the MS Office formats won a long running battle. If they hasn't won documents would still be in someone's proprietary format.

      1. JetSetJim

        Re: Vint says:-

        It would be nice/useful if Microsoft provided a function within each Office application to "update all files in selected folder to the latest release of the application", with a tickbox for searching sub-folders and a tickbox for preserving the old document.

        I'm sure there are a lot of ways this can break - e.g. I seem to recall that the implementation of pivot tables in Excel changed in moving to Office 2003, so it might be nice to know about this sort of thing...

        Perhaps a first pass to do a scan to see which documents are there, which have simple conversions and which will break. Then for the ones that break, a dialogue that walks you through what you will lose if the conversion process continues, allowing you to decide if you want to convert or not.

        1. Roger Greenwood

          Re: Vint says:-

          "update all files"

          e.g. http://hub.sd63.bc.ca/pluginfile.php/98/mod_resource/content/3/Batch%20conversion%20of%20word%20files.pdf

          You did want .odt didn't you?

          1. JetSetJim

            Re: Vint says:-

            Looks lovely - now get MS to write one :)

          2. Anonymous Coward
            Anonymous Coward

            Re: Vint says:- @Roger Greenwood

            "You did want .odt didn't you?"

            -- Since he referred to Microsoft and Office rather than Libre/OpenOffice, it seems reasonable to presume not. Not really a correct response to his wish there, I'm afraid. Regardless of whether you think he should switch to LO/OO.

        2. JLV
          Happy

          ... Then for the ones that break...

          Hah!

          My Word 2007 constantly nags me about compatibility and losing information if I have the audacity to save to .doc rather than .docx. Ditto Excel (my Python scripts don't handle xlsx, which is why I do xls).

          Despite their specific assurance that "3 items are being de-formatted, please see help for details", I have never ever found a way to list what is being broken. In fact, considering that many of those Excels come out of report generators that start them out as xls, I rather doubt anything serious is being lost.

          And you are asking for a universal converter, with running conversion details? From Steve "we listen to our customers" Ballmer?

      2. John Smith 19 Gold badge
        Unhappy

        Re: Vint says:-

        "Be realistic, the MS Office formats won a long running battle. If they hasn't won documents would still be in someone's proprietary format."

        Instead they are in Microsofts proprietary format.

        Which keeps changing.

        How well is it documented now?

      3. hplasm
        Meh

        Re: Vint says:-

        If they hasn't won documents would still be in someone ELSE'S proprietary format.

        Perhaps...

      4. tom dial Silver badge

        Re: Vint says:-

        Like roff/troff? These were quite well documented, as is Tex. And ODF, whatever its claimed limits probably is decently documented. I've seen some old (pre GUI) Wordstar files that I recall looking much like those of roff, although there probably were differences and the formatting might not have been well documented.

        I don't think it is unreasonable to assign Microsoft a part of the blame, along with other less successful vendors who probably practiced the same type of obfuscation.

        1. Michael Wojcik Silver badge

          Re: Vint says:-

          Like roff/troff? These were quite well documented, as is Tex.

          Indeed. More generally, plain-text-plus-markup document formats are much older than horrible-proprietary-binary ones, and far, far more robust. The roff family dates back to CTSS RUNOFF (1964). SGML-based markup languages, including HTML, go back to IBM SCRIPT and GML, from a few years later. TeX1 didn't show up for another ten years, but that's still five years before the first version of Microsoft Word. (LaTeX is roughly contemporaneous with the initial versions of Word, and of course long predates Word for Windows, much less Microsoft Office.)

          So the greatly-superior alternative of using plain text with markup was well-known when Word appeared. Of course Word was also not the first document editor to use a binary format. WordStar had been doing it for a few years (though it also offered "non-document mode"); the Wang word processing software was adapted into MultiMate for the IBM PC; WordPerfect had been around since '79 (on Data General machines).2 But the Word developers still made the wrong choice, when both options were well-established.

          1Unfortunately, the Reg's subset of permitted HTML tags won't let me format that correctly. Oh well.

          2WordPerfect used a markup system internally, but the tags were formed with non-printable code points, so it wasn't a plain-text-markup design.

    2. Mips
      Childcatcher

      Re: Vint says:-

      Why blame anyone except yourselves?

      Look, we have exactly the same problem understanding ancient English, even Shakespeare is a foreign country to most people.

      Just do a save as plain text and sod the formatting.

      1. Michael Wojcik Silver badge

        Re: Vint says:-

        Look, we have exactly the same problem understanding ancient English,

        Well, we would, since there is no such language.

        Old English is very different from Modern English, true. It also fell out of use many centuries ago, unlike Office '97.

        Middle English (Chaucer, for example) can be easily picked up by anyone highly literate in Modern English. You'll need a glossary for some archaic diction and usage, but most of it is readily obvious from context.

        even Shakespeare is a foreign country to most people.

        Early Modern (Elizabethan) English shouldn't give any competent reader of Modern English any trouble. Again, a glossary or the occasional footnote helps, but that's often true with Modern English as well, as the language has an enormous vocabulary and is highly irregular.

        Even with those irregularities, though, natural languages have sufficient consistency and redundancy that they can be decoded even with very small samples. We figured out how to read cuneiform, for heaven's sake - that's a unique writing system from a linguistic isolate that went out of use thousands of years ago. Binary file formats, on the other hand, tend to be riddled with arbitrary signals, often contain insufficient redundancy, and in many cases are too rare to provide a decent corpus for analysis.

  2. Roger Greenwood

    "I'm not blaming Microsoft,"

    Who else would you blame Vint, yourself?

  3. Steve Knox
    Paris Hilton

    Perhaps...

    ...how his up-to-date version of Microsoft Word can't read Powerpoint files created in 1997.

    he might want to try using Powerpoint for that?

    1. graeme leggett Silver badge

      Re: Perhaps...

      he doesn't even need a copy of Office 97.

      From MS download page

      "Microsoft PowerPoint Viewer lets you view full-featured presentations created in PowerPoint 97 and later versions"

      1. Anonymous Coward
        Pint

        Re: Perhaps...

        Maybe Vint should have Cerfed the internet to find that answer. Anyway, did he try using Open Office?

      2. Richard Gadsden

        Re: Perhaps...

        And if the presentation was created in 1997 using a version of PowerPoint earlier than 97?

        Word file formats are different for Word/DOS, Word/Win1.0, Word 2.0, Word 6.0-95, Word 97-2003 and Word 2007-2013. Recent versions of Word can't read the Word 6/95 format, much less the three earlier ones; I'm sure PowerPoint is the same.

    2. Anonymous Coward
      Anonymous Coward

      Re: Perhaps...

      I'm afraid this is just a case of poor quoting by the Reg. Here's the orignal text as it appeared in Computerworld.

      Cerf illustrated the problem in a simple way. He runs Microsoft Office 2011 on Macintosh, but it cannot read a 1997 PowerPoint file. "It doesn't know what it is," he said.

      http://www.computerworld.com/s/article/9239790/Cerf_sees_a_problem_Today_s_digital_data_could_be_gone_tomorrow_#disqus_thread

  4. Pen-y-gors

    This is not new

    The problems of being able to access old electronic documents were identified years ago, and various strategies have been suggested.

    One approach is to create multiple copies of a document in different formats:

    a) the original file, a stream of bits;

    b) an updated version of the original file, created as part of a regular migration/update startegy - i.e. use Office 2010 to read an office 1997 doc while it can and save it in Office 2010 format,

    and c) create a simple version that concentrates on preserving the important content, not worrying too much about precise layout and formatting and clever stuff (but include the meta data), and save that in a fairly plain vanilla format that is likely to be readable for a reasonable time.

    None of the above is perfect, and b) in particular means people have to do a lot of work regularly migrating old documents to a new format.

    Of course, this all assumes that you can read the file in the first place. How many County Record Offices have an old Amstrad 8256 word processor handy, with the ability to read the old 3" floppies and export them to some sort of network.

    I'm a great fan of the fallback method: print every document out on acid-free paper and store it in a nitrogen-filled vault, or microfilm the printouts - or just print straight to archive-grade microfilm, then all you need to read the document is a magnifying glass and a torch!

    1. Anonymous Coward
      Meh

      Re: This is not new

      "The problems of being able to access old electronic documents were identified years ago, and various strategies have been suggested."

      I think Vint starts from the Google perspective of wanting to mine that data. In the real world, businesses lose and forget anything electronic that is older than eighteen months, and only the legal/property people have any concept of archiving and retrieving documents, and they usually stick to the physical. Give those archive capabilities to the rest of the business, and you find yourself paying Iron Mountain year after year to store Christmas decorations, the unindexed contents of retired employees desks, or the IT department's original install 3.5 inch floppies for Borland and manuals for applications and operating systems long since gone.

      Printed books, documents, and whole lot of original data have a half life (and always have had), and with the accelerated creation of new more and more electronic documents, losing them is rarely going to be that much of a loss. Where data is important and it is used, then it will be refreshed, preserved or updated, indeed the point of vellum was to preserve the important, not the routine. For the rest (including much of my own output) it doesn't really matter if it become unreadable in five or ten years time.

      1. Brennan Young

        Re: This is not new

        Yes and no. Documents regarded as routine or ephemera today may turn out to be valuable, or of historical importance tomorrow. The point is we can't make the same judgements about the importance or value of documents in advance that hindsight would lead us to in future. Original letters by famous people are some of the most valuable artefacts to appear at auction. Even if its just Karl Marx's laundry bill, someone will pay big money for it.

        But MS and others should take some of the blame: their document formats are clearly and simply under-Engineered for longevity. Open formats go a long way to solving this problem, but they are eschewed for many other reasons. Potential longevity is not a selling point, unfortunately

    2. grammarpolice

      Re: This is not new

      Generally the County Record Offices these days are wise to this and busily converting everything to PDF/A, which Vint seems to have completely overlooked.

    3. Phil O'Sophical Silver badge

      Re: This is not new

      > Of course, this all assumes that you can read the file in the first place.

      That is surely the real problem? Reverse engineering a document or file format is easy*, compared to reinventing an 5¼" floppy drive when the only storage you've ever seen is flash memory cards.

      * For certain small values of easy, but even so...

    4. Michael Wojcik Silver badge

      Re: This is not new

      Yes, it's not new, and indeed there's been quite considerable effort put into avoiding and alleviating it, like the ELO's Acid-Free Bits recommendations. The only news here is that Cerf's commented on the problem.

      1. Phil O'Sophical Silver badge
        Coat

        Re: This is not new

        "ELO's Acid-Free Bits"

        that would be Mr Blue Sky research, then?

  5. Anonymous Coward
    Anonymous Coward

    And that's why we should be storing documents using an open standard, like ODF. This guarantees that new generations will know how the file should be interpreted.

    1. Anonymous Coward
      Anonymous Coward

      Assuming people are still making tools that use the format. Just because it's open doesn't necessarily mean anyone will want to write software supporting it. Sure, in theory that makes it easier to write your -own-, but the number of people for whom that's practical is vanishingly small.

      FOSS isn't a panacea; it may not be subject to the same market forces as normal software, but it's still subject to 'market forces' of a sort.

      1. Ru

        Just because it's open doesn't necessarily mean anyone will want to write software supporting it

        But if they do want to, then they can. Compare and contrast with the difficulties involved in dissecting an old binary format which may never have been documented outside of the company who created it, who might not even exist anymore. It is effectively cryptanalysis. In some cases it is cryptanalysis thanks to deliberate efforts by the vendor to obfuscate data or due to the presence of some sort of DRM.

        in theory that makes it easier to write your -own-, but the number of people for whom that's practical is vanishingly small

        If the data in the problem format is valuable to someone, then there is an incentive for someone to write a suitable transcoder. If the data is not valuable, then who cares? The difference is that dealing with open formats is a comparatively cheap job, as the number of people who could write a suitable transcoder is vastly higher than the number of people capable of reverse engineering an undocumented proprietary format.

        1. Anonymous Coward
          Anonymous Coward

          "If the data in the problem format is valuable to someone, then there is an incentive for someone to write a suitable transcoder."

          Not necessarily - the person who values the data doesn't necessarily have anything of value to the person who's capable of writing the transcoder.

          For example, I have a whole ton of music I wrote in an old tracker format. It's extraordinarily important to me. But that doesn't mean that, should I run out of ways to load the files, someone who wants to write a program to read it will spring from the ground, willing to work on terms I could afford.

          Just because something is valuable doesn't mean that there are resources to take advantage of that value.

          Agreed, though, that better (or at-all) documented specs and setup of file formats do help quite a bit. I just have a problem with the, "If it's valued someone will write a decoder" argument; it belies a fundamental misunderstanding of basic economics.

          1. Anonymous Coward
            Anonymous Coward

            I don't think a new "decoder", as you call it, would even have to be made.

            The difference between, say ODF and DOC are that the former has many working, open-source implementations for it, while the latter only has the Microsoft implementation which also just works on their Windows platform.

            If Microsoft were to go out of business or people stop using Windows, it gets hard to read these files. This will not happen to ODF. Even if we move to an entirely new platform it would just be a matter of cross-compiling an existing interpreter.

      2. Anonymous Coward
        Anonymous Coward

        and the connection between a file format and open source is?

        ODF helps - its a freely published standards that any one can expect access to , has no orphan licencing issues and is in the worst case human interpretable (ish) XML.

        Its a lot more accessible and possible to implement than the competition.

        Meta data - particularly on any dependencies in the document would help too.

      3. tom dial Silver badge

        But if the data is important enough, the free-ness will enable someone with enough intelligence and motivation and motivation to write it from scratch, as you said. With properly open formats the media deterioration and lack of appropriate devices is likely to be the difficult part, something that may not be true with proprietary formats.

  6. Anonymous Coward
    Anonymous Coward

    Oh, I dunno, how about someone comes up with a format for documents that's open. Maybe even call it Open Document Format.

    There's also that relatively unknown 'xml' way of storing data.

  7. Steve Knox
    Holmes

    It may be that a Ferrari will help my trips to the shops a lot.

    "It may be that the cloud computing environment will help a lot. It may be able to emulate older hardware on which we can run operating systems and applications,"

    We don't need the cloud for that. I've got an 8 -year-old-desktop that can emulate 20-year-old hardware perfectly fine. Heck, I remember running a Z80 emulator on a Z80.

    1. Michael Wojcik Silver badge

      Re: It may be that a Ferrari will help my trips to the shops a lot.

      Agreed. Emulation-as-a-service might be useful for some people (for convenience and automatic translation), but utility provisioning of IT resources is just an implementation issue. There's no need to invoke the magic "cloud" here, and I have no idea why Cerf did so.

  8. Code Monkey

    Most of the Word documents I produce are complete shite anyway. Lack of backward compatibilty is sparing future generations some tremendous dullardery.

    1. Anonymous Coward
      Anonymous Coward

      And if our grandchildren are as clever as Mr Cerf, they'll be trying to open your Word docs in Paint Shop Pro.

    2. Flywheel
      Stop

      Will we really, honestly all that PowerPoint shite with the primary-school clip art and tada.wav?

      Is it really worth preserving?

  9. This post has been deleted by its author

  10. Pete 2 Silver badge

    From a historical perspective

    > his up-to-date version of Microsoft Word can't read Powerpoint files created in 1997

    ... nothing of any consequence has ever appeared on a PP presentation.

    Unlike present day archaeology, where making a "find" is a rare event due to the scarcity of old artefacts, I expect the researchers of tomorrow will have the opposite problem: trying to work out which is THE ONE significant piece of work amongst the hundreds of billions of pieces of crap, spam, tweets and pr0n. After that, decoding the format (surely just stripping out all the non-ASCII is 99% of the job) will be a trivial matter.

    1. RW
      Boffin

      Re: From a historical perspective

      Not the researchers of tomorrow, but those of today.

      Out of personal interest, I've been cataloging YouTube videos of the March 11, 2011 Tohoku tsunami. The extremes are (a) those videos that have been watched by hundreds of thousands of people and reposted to YouTube by a good many of them; and (b) those that have been watched by very few and exist on YouTube only in one version. The object is to identify the best version of each significant video, best meaning most complete, preferably with a good deshaker applied.

      Of course this is a hopeless task, as there are something on the order of 100,000 tsunami videos, far too large a number to catalog by hand. But even disregarding that minor issue, trying to figure out which version is original and complete is like trying to find a needle in a haystack. Today.

  11. Destroy All Monsters Silver badge
    Mushroom

    PONG! "Letter_to_hobbyists.doc cannot be opened by Microsoft Crudware 3010!"

    "Spreadsheets, documents and various collections of data will be unreadable by future generations."

    And nothing of value was lost.

    "What I'm saying is that backward compatibility is very hard to preserve over very long periods of time."

    Not it's not. It's hard when it is made by a whole bunker of losers who foist negative externalities created by clueless primadonna uberdevelopers on the unsuspecting world because "muh bottom line!". Akin to dumping radioactive crap into the nearest river (you hear, government nuke "labs"?)

    1. Anonymous Coward
      Anonymous Coward

      Re: PONG! "Letter_to_hobbyists.doc cannot be opened by Microsoft Crudware 3010!"

      The article was written in quite a reasonable way, I thought. Compare and contrast that with what you've written. If I were looking at a way forward, I don't think I'd be paying much attention to you - it's obvious you have one opinion and by God you won't consider anything else ...

      1. Destroy All Monsters Silver badge
        Thumb Down

        Re: PONG! "Letter_to_hobbyists.doc cannot be opened by Microsoft Crudware 3010!"

        I don't do criticism by ACs

        1. Anonymous Coward
          Anonymous Coward

          Re: PONG! "Letter_to_hobbyists.doc cannot be opened by Microsoft Crudware 3010!"

          "I don't do criticism by ACs"

          Of course not, genius. We do it. "I don't listen to" would have been more valid. As useful as, "I don't believe in evolution", but at least logically consistent :)

  12. Anonymous Coward
    Anonymous Coward

    Cant beat ASCII

    I have 15 years of text docs, whenever I need to I can always access my data.

    1. Francis Fish
      Happy

      Re: Cant beat ASCII

      Except when it's utf-8 - been burned by the two char space that looks fine in all editors but doesn't render properly when you turn it into a book.

      More seriously - only geeks like us use text. This doesn't help at all. I think the point other people are making about OpenOffice is valid. I've personally had far more success with it opening documents that MS own software. I even managed to rescue most of the content from a document that Word had completely broken.

      1. Phil O'Sophical Silver badge
        Headmaster

        Re: Cant beat ASCII

        ASCII can't be UTF-8, it's a 7-bit code...

      2. Michael Wojcik Silver badge

        Re: Cant beat ASCII

        only geeks like us use text

        Quite a few people use HTML, I think you'll find. And more importantly, non-technical users will happily use plain-text-plus-markup if they don't have to know about it. There's no reason why a WYSISVSTWYG1 GUI can't be slapped on top of a markup file format. That's what WordPerfect did; though its format was not, alas, plain text, there's no reason why it couldn't have been. That's close to what LyX does with LaTeX, except that LyX doesn't pretend to be WYSIWYG and exposes too many technical features for some users' comfort. But there is nothing inherently "geeky" or specialized about plain-text-plus-markup as a file format.

        1What You See Is Vaguely Similar To What You Get

Page:

This topic is closed for new posts.

Other stories you might like