Vint says:-
"I'm not blaming Microsoft..."
Why the hell not? It's their fault.
Big data may turn out to be a big mystery to future generations, godfather of the internet Vint Cerf has warned. The pioneering computer scientist, who helped design the TCP/IP protocol (along with Robert Kahn) before going on to work as chief internet evangelist for Google, has claimed that spreadsheets, documents and various …
It would be nice/useful if Microsoft provided a function within each Office application to "update all files in selected folder to the latest release of the application", with a tickbox for searching sub-folders and a tickbox for preserving the old document.
I'm sure there are a lot of ways this can break - e.g. I seem to recall that the implementation of pivot tables in Excel changed in moving to Office 2003, so it might be nice to know about this sort of thing...
Perhaps a first pass to do a scan to see which documents are there, which have simple conversions and which will break. Then for the ones that break, a dialogue that walks you through what you will lose if the conversion process continues, allowing you to decide if you want to convert or not.
Hah!
My Word 2007 constantly nags me about compatibility and losing information if I have the audacity to save to .doc rather than .docx. Ditto Excel (my Python scripts don't handle xlsx, which is why I do xls).
Despite their specific assurance that "3 items are being de-formatted, please see help for details", I have never ever found a way to list what is being broken. In fact, considering that many of those Excels come out of report generators that start them out as xls, I rather doubt anything serious is being lost.
And you are asking for a universal converter, with running conversion details? From Steve "we listen to our customers" Ballmer?
Like roff/troff? These were quite well documented, as is Tex. And ODF, whatever its claimed limits probably is decently documented. I've seen some old (pre GUI) Wordstar files that I recall looking much like those of roff, although there probably were differences and the formatting might not have been well documented.
I don't think it is unreasonable to assign Microsoft a part of the blame, along with other less successful vendors who probably practiced the same type of obfuscation.
Like roff/troff? These were quite well documented, as is Tex.
Indeed. More generally, plain-text-plus-markup document formats are much older than horrible-proprietary-binary ones, and far, far more robust. The roff family dates back to CTSS RUNOFF (1964). SGML-based markup languages, including HTML, go back to IBM SCRIPT and GML, from a few years later. TeX1 didn't show up for another ten years, but that's still five years before the first version of Microsoft Word. (LaTeX is roughly contemporaneous with the initial versions of Word, and of course long predates Word for Windows, much less Microsoft Office.)
So the greatly-superior alternative of using plain text with markup was well-known when Word appeared. Of course Word was also not the first document editor to use a binary format. WordStar had been doing it for a few years (though it also offered "non-document mode"); the Wang word processing software was adapted into MultiMate for the IBM PC; WordPerfect had been around since '79 (on Data General machines).2 But the Word developers still made the wrong choice, when both options were well-established.
1Unfortunately, the Reg's subset of permitted HTML tags won't let me format that correctly. Oh well.
2WordPerfect used a markup system internally, but the tags were formed with non-printable code points, so it wasn't a plain-text-markup design.
Look, we have exactly the same problem understanding ancient English,
Well, we would, since there is no such language.
Old English is very different from Modern English, true. It also fell out of use many centuries ago, unlike Office '97.
Middle English (Chaucer, for example) can be easily picked up by anyone highly literate in Modern English. You'll need a glossary for some archaic diction and usage, but most of it is readily obvious from context.
even Shakespeare is a foreign country to most people.
Early Modern (Elizabethan) English shouldn't give any competent reader of Modern English any trouble. Again, a glossary or the occasional footnote helps, but that's often true with Modern English as well, as the language has an enormous vocabulary and is highly irregular.
Even with those irregularities, though, natural languages have sufficient consistency and redundancy that they can be decoded even with very small samples. We figured out how to read cuneiform, for heaven's sake - that's a unique writing system from a linguistic isolate that went out of use thousands of years ago. Binary file formats, on the other hand, tend to be riddled with arbitrary signals, often contain insufficient redundancy, and in many cases are too rare to provide a decent corpus for analysis.
And if the presentation was created in 1997 using a version of PowerPoint earlier than 97?
Word file formats are different for Word/DOS, Word/Win1.0, Word 2.0, Word 6.0-95, Word 97-2003 and Word 2007-2013. Recent versions of Word can't read the Word 6/95 format, much less the three earlier ones; I'm sure PowerPoint is the same.
I'm afraid this is just a case of poor quoting by the Reg. Here's the orignal text as it appeared in Computerworld.
Cerf illustrated the problem in a simple way. He runs Microsoft Office 2011 on Macintosh, but it cannot read a 1997 PowerPoint file. "It doesn't know what it is," he said.
http://www.computerworld.com/s/article/9239790/Cerf_sees_a_problem_Today_s_digital_data_could_be_gone_tomorrow_#disqus_thread
The problems of being able to access old electronic documents were identified years ago, and various strategies have been suggested.
One approach is to create multiple copies of a document in different formats:
a) the original file, a stream of bits;
b) an updated version of the original file, created as part of a regular migration/update startegy - i.e. use Office 2010 to read an office 1997 doc while it can and save it in Office 2010 format,
and c) create a simple version that concentrates on preserving the important content, not worrying too much about precise layout and formatting and clever stuff (but include the meta data), and save that in a fairly plain vanilla format that is likely to be readable for a reasonable time.
None of the above is perfect, and b) in particular means people have to do a lot of work regularly migrating old documents to a new format.
Of course, this all assumes that you can read the file in the first place. How many County Record Offices have an old Amstrad 8256 word processor handy, with the ability to read the old 3" floppies and export them to some sort of network.
I'm a great fan of the fallback method: print every document out on acid-free paper and store it in a nitrogen-filled vault, or microfilm the printouts - or just print straight to archive-grade microfilm, then all you need to read the document is a magnifying glass and a torch!
"The problems of being able to access old electronic documents were identified years ago, and various strategies have been suggested."
I think Vint starts from the Google perspective of wanting to mine that data. In the real world, businesses lose and forget anything electronic that is older than eighteen months, and only the legal/property people have any concept of archiving and retrieving documents, and they usually stick to the physical. Give those archive capabilities to the rest of the business, and you find yourself paying Iron Mountain year after year to store Christmas decorations, the unindexed contents of retired employees desks, or the IT department's original install 3.5 inch floppies for Borland and manuals for applications and operating systems long since gone.
Printed books, documents, and whole lot of original data have a half life (and always have had), and with the accelerated creation of new more and more electronic documents, losing them is rarely going to be that much of a loss. Where data is important and it is used, then it will be refreshed, preserved or updated, indeed the point of vellum was to preserve the important, not the routine. For the rest (including much of my own output) it doesn't really matter if it become unreadable in five or ten years time.
Yes and no. Documents regarded as routine or ephemera today may turn out to be valuable, or of historical importance tomorrow. The point is we can't make the same judgements about the importance or value of documents in advance that hindsight would lead us to in future. Original letters by famous people are some of the most valuable artefacts to appear at auction. Even if its just Karl Marx's laundry bill, someone will pay big money for it.
But MS and others should take some of the blame: their document formats are clearly and simply under-Engineered for longevity. Open formats go a long way to solving this problem, but they are eschewed for many other reasons. Potential longevity is not a selling point, unfortunately
> Of course, this all assumes that you can read the file in the first place.
That is surely the real problem? Reverse engineering a document or file format is easy*, compared to reinventing an 5¼" floppy drive when the only storage you've ever seen is flash memory cards.
* For certain small values of easy, but even so...
Yes, it's not new, and indeed there's been quite considerable effort put into avoiding and alleviating it, like the ELO's Acid-Free Bits recommendations. The only news here is that Cerf's commented on the problem.
Assuming people are still making tools that use the format. Just because it's open doesn't necessarily mean anyone will want to write software supporting it. Sure, in theory that makes it easier to write your -own-, but the number of people for whom that's practical is vanishingly small.
FOSS isn't a panacea; it may not be subject to the same market forces as normal software, but it's still subject to 'market forces' of a sort.
Just because it's open doesn't necessarily mean anyone will want to write software supporting it
But if they do want to, then they can. Compare and contrast with the difficulties involved in dissecting an old binary format which may never have been documented outside of the company who created it, who might not even exist anymore. It is effectively cryptanalysis. In some cases it is cryptanalysis thanks to deliberate efforts by the vendor to obfuscate data or due to the presence of some sort of DRM.
in theory that makes it easier to write your -own-, but the number of people for whom that's practical is vanishingly small
If the data in the problem format is valuable to someone, then there is an incentive for someone to write a suitable transcoder. If the data is not valuable, then who cares? The difference is that dealing with open formats is a comparatively cheap job, as the number of people who could write a suitable transcoder is vastly higher than the number of people capable of reverse engineering an undocumented proprietary format.
"If the data in the problem format is valuable to someone, then there is an incentive for someone to write a suitable transcoder."
Not necessarily - the person who values the data doesn't necessarily have anything of value to the person who's capable of writing the transcoder.
For example, I have a whole ton of music I wrote in an old tracker format. It's extraordinarily important to me. But that doesn't mean that, should I run out of ways to load the files, someone who wants to write a program to read it will spring from the ground, willing to work on terms I could afford.
Just because something is valuable doesn't mean that there are resources to take advantage of that value.
Agreed, though, that better (or at-all) documented specs and setup of file formats do help quite a bit. I just have a problem with the, "If it's valued someone will write a decoder" argument; it belies a fundamental misunderstanding of basic economics.
I don't think a new "decoder", as you call it, would even have to be made.
The difference between, say ODF and DOC are that the former has many working, open-source implementations for it, while the latter only has the Microsoft implementation which also just works on their Windows platform.
If Microsoft were to go out of business or people stop using Windows, it gets hard to read these files. This will not happen to ODF. Even if we move to an entirely new platform it would just be a matter of cross-compiling an existing interpreter.
and the connection between a file format and open source is?
ODF helps - its a freely published standards that any one can expect access to , has no orphan licencing issues and is in the worst case human interpretable (ish) XML.
Its a lot more accessible and possible to implement than the competition.
Meta data - particularly on any dependencies in the document would help too.
But if the data is important enough, the free-ness will enable someone with enough intelligence and motivation and motivation to write it from scratch, as you said. With properly open formats the media deterioration and lack of appropriate devices is likely to be the difficult part, something that may not be true with proprietary formats.
"It may be that the cloud computing environment will help a lot. It may be able to emulate older hardware on which we can run operating systems and applications,"
We don't need the cloud for that. I've got an 8 -year-old-desktop that can emulate 20-year-old hardware perfectly fine. Heck, I remember running a Z80 emulator on a Z80.
Agreed. Emulation-as-a-service might be useful for some people (for convenience and automatic translation), but utility provisioning of IT resources is just an implementation issue. There's no need to invoke the magic "cloud" here, and I have no idea why Cerf did so.
This post has been deleted by its author
> his up-to-date version of Microsoft Word can't read Powerpoint files created in 1997
... nothing of any consequence has ever appeared on a PP presentation.
Unlike present day archaeology, where making a "find" is a rare event due to the scarcity of old artefacts, I expect the researchers of tomorrow will have the opposite problem: trying to work out which is THE ONE significant piece of work amongst the hundreds of billions of pieces of crap, spam, tweets and pr0n. After that, decoding the format (surely just stripping out all the non-ASCII is 99% of the job) will be a trivial matter.
Not the researchers of tomorrow, but those of today.
Out of personal interest, I've been cataloging YouTube videos of the March 11, 2011 Tohoku tsunami. The extremes are (a) those videos that have been watched by hundreds of thousands of people and reposted to YouTube by a good many of them; and (b) those that have been watched by very few and exist on YouTube only in one version. The object is to identify the best version of each significant video, best meaning most complete, preferably with a good deshaker applied.
Of course this is a hopeless task, as there are something on the order of 100,000 tsunami videos, far too large a number to catalog by hand. But even disregarding that minor issue, trying to figure out which version is original and complete is like trying to find a needle in a haystack. Today.
"Spreadsheets, documents and various collections of data will be unreadable by future generations."
And nothing of value was lost.
"What I'm saying is that backward compatibility is very hard to preserve over very long periods of time."
Not it's not. It's hard when it is made by a whole bunker of losers who foist negative externalities created by clueless primadonna uberdevelopers on the unsuspecting world because "muh bottom line!". Akin to dumping radioactive crap into the nearest river (you hear, government nuke "labs"?)
The article was written in quite a reasonable way, I thought. Compare and contrast that with what you've written. If I were looking at a way forward, I don't think I'd be paying much attention to you - it's obvious you have one opinion and by God you won't consider anything else ...
Except when it's utf-8 - been burned by the two char space that looks fine in all editors but doesn't render properly when you turn it into a book.
More seriously - only geeks like us use text. This doesn't help at all. I think the point other people are making about OpenOffice is valid. I've personally had far more success with it opening documents that MS own software. I even managed to rescue most of the content from a document that Word had completely broken.
only geeks like us use text
Quite a few people use HTML, I think you'll find. And more importantly, non-technical users will happily use plain-text-plus-markup if they don't have to know about it. There's no reason why a WYSISVSTWYG1 GUI can't be slapped on top of a markup file format. That's what WordPerfect did; though its format was not, alas, plain text, there's no reason why it couldn't have been. That's close to what LyX does with LaTeX, except that LyX doesn't pretend to be WYSIWYG and exposes too many technical features for some users' comfort. But there is nothing inherently "geeky" or specialized about plain-text-plus-markup as a file format.
1What You See Is Vaguely Similar To What You Get