back to article Inside Internet Archive: 10PB+ of storage in a church... oh, and a little fight to preserve truth

At the Internet Archive's headquarters in San Francisco, California, on Wednesday, technologists, educators, archivists, and others fact-oriented folks gathered to discuss how they and the like-minded can save news from the memory hole – a conceit conjured by George Orwell to describe a political mechanism for altering the truth …

  1. Florida1920

    Truth vs Sheep

    Securing the truth is one thing. Getting the sheeple to believe and act on it is another. The U.S. political ecosystem alone is living proof of that. 1984 was supposed to be a work of fiction, not an instruction manual.

    1. FrankAlphaXII

      Re: Truth vs Sheep

      If you consider a massively contaminated cesspool an ecosystem, you'd be correct. Our current political climate is a lot like Lake Karachay.

      My grandad always said that they're all a bunch of crooks. I used to love listening to him rail at dumbass local politicians who'd call him and he'd always finish with "Y'know what? You're all a bunch of crooks". He was from Chicago and it was indeed true, they were literally all a bunch of crooks, especially there and especially in the era he grew up in.

      However after the past two years, I have to amend Grandpa's wisdom. They're too inept anymore to be crooks. They're all a bunch of fucking yes-man clowns led by psychopaths on both sides with no original ideas in their heads besides threatening to bomb people.

    2. Anonymous Coward
      Anonymous Coward

      Re: Truth vs Sheep

      Looking at Aldous Huxley's and George Orwell's backgrounds, you might well conclude that neither of their works were "speculative fiction", but actually "announcements".

      - A lot of sheeple have an inkling of what's wrong, but are far too afraid to act on it.

      - Other sheeple have an inkling, but refuse to entertain it, because they know it would turn their lives upside down.

      In the end, we are all responsible, because our own small scale corruption enables the huge scale corruption of stealing the entire world from us, to which they are getting pretty damned near. 20 banks, 140 corporations and a smallish number of shadowy figures controlling them globally.

      If there weren't so many people who's compliance you could *buy* with small change, they couldn't recruit a sufficient amount of helpers-helpers to pull off this worldwide plan of fraud, mass murder and thinly veiled slavery.

  2. Dan 55 Silver badge

    Loophole

    Does it still take notice of present day robot.txt files which can be used block URLs from the past?

    That'a not a particularly good decision.

    1. teknopaul

      Re: Loophole

      You have to listen to robots.txt some links are like /delete and other links that website owners dont want followed by spiders not necessarily because it is hidden pages. If you want to hide data from spiders there are easy ways to do it that dont use robots.txt. breaking robots.txt would be a bad idea.

      1. Yet Another Anonymous coward Silver badge

        Re: Loophole

        Spiders aren't really relevant.

        The archive is to preserve what was visible, and to prevent the government changing the whether we were at war with Eurasia or Eastasia. It doesn't matter if they remove the old page or block it with spiders - as long as the archive copied what was visible

      2. L05ER

        Re: Loophole

        The issue is applying a current robots.txt to an archived version of a site.

        For example: Try viewing the official Chevrolet Monte Carlo website from 2006 on archive.org. It reads the current Chevrolet.com robots.txt and disallows access to archived content.

        It has nothing to do with current web standards and is solely about how the way back machine handles robots.txt in relation to previously scraped and archived content.

  3. Ralph the Wonder Llama
    Mushroom

    More of this sort of thing

    I love it when formerly religious buildings are purposed (I was going to say repurposed, but the prefix seems redundant to me) into something useful, like this. Or, say, pubs.

    1. Unep Eurobats
      Go

      Re: More of this sort of thing

      Yes: cafes, concert halls, art venues... And of course data centres in those nice, cool, secure crypts.

    2. steelpillow Silver badge
      Angel

      Re: More of this sort of thing

      ISTR that Christ himself was well known for his habit of consorting with the "publicans and sinners". I am sure He would approve of your sentiments.

      1. hplasm
        Happy

        Re: More of this sort of thing

        "ISTR that Christ himself was well known for..."

        Jeez, your'e old!!

    3. Hollerithevo

      Re: More of this sort of thing

      Christian Science Reading Rooms, to be accurate, were mostly build in an era of cool architectural design. Cadogan Hall in London UK, now turned into a performance hall with its own resident orchestra, has a beautiful auditorium and brilliant acoustics.

      1. Mark 85

        Re: More of this sort of thing

        To be more accurate.. it's a former Christian Science Church they use. The Reading Rooms are a different critter and usually pretty small compared to the churches themselves.

    4. Anonymous Coward
      Anonymous Coward

      Re: More of this sort of thing

      Vilnius has a museum of atheism in a former church. Seems sensible to me.

  4. FrankAlphaXII
    Thumb Up

    Interesting that they keep a mirror at the modern reincarnation of the place that centralizing most of the Greco-Roman world's knowledge at that one location paid such great dividends.

    Good on them though, someone has to do what they're doing.

    I especially like the idea of the PDF readers automatically linking the cited papers in the footnotes. I have to read a lot of academic writing since emergency preparedness and response is in a constant state of evolution, and that would save a lot of time and there would likely be less time wasted getting pissed off at Elsevier and the rest of the academic publishing cartel if I could see which journal whatever paper is in (and if we pay for it or not) based on the linked URL alone.

    1. patrickstar

      ITYM "and if it's accessible via SciHub".

      Easy to automatically find PMIDs and DOIs and link them straight there as well...

  5. Adam 1

    distributed knowledge?

    A few months back we read about a whole bunch of early hp documents that were lost to a natural disasters (fire from memory). It strikes me as quite all eggs in one basket to have such important historical data in one location. How do they backup their data? I know many folk here have a few 10s of GB HDD space. It would be a really interesting project to ask people to donate a few GB storage and a small amount of download/upload bandwidth to truly securing that data. If sharded the right way, you could reasonably have confidence that all information is held in multiple regions, detect where backup nodes are MIA and replicate the at risk data to new nodes.

    1. Dan 55 Silver badge

      Re: distributed knowledge?

      About a year ago they were talking about setting up a full backup in Canada.

    2. Hollerithevo

      Re: distributed knowledge?

      I would be happy to donate spare capacity. I am honoured that my very first website, 1998, has several iterations on the Wayback Machine.

      1. anothercynic Silver badge

        Re: distributed knowledge?

        I've got you beat by two years... my first iteration hit the WBM on 20 December 1996... :-D

    3. DuncanLarge Silver badge

      Re: distributed knowledge?

      I was thinking that also. This is very much how freenet works.

    4. phuzz Silver badge
      Thumb Up

      Re: distributed knowledge?

      If you'd like to help keep an additional backup of the Internet Archive (there are several already, they're not daft) there's a project called ia.bak which uses git annex to store a copy of part of the data.

      All you do is decide how much disk space and bandwidth you can spare, and then you can just walk away and leave it.

    5. JeffyPoooh
      Pint

      Re: distributed knowledge?

      My first idea is that your valuable data can be watermarked into pr0nography files (e.g. naughty videos), and then uploaded to the 'net. Within seconds, dozens of thieving/freeloading pr0n servers will steal copies of these files, and host them on their own pr0n servers for fun and ill-gotten profit. So your valuable data, secretly watermarked into the files, will be widely distributed and publicly available. It's really the ultimate free, distributed, crowd-sourced backup system. It'll almost certainly survive nuclear war and asteroid impacts. And it justifies smurfing pr0n during working hours, 'cause ya know, "...just checking the backups."

      My second idea is that precisely all this has already happened. Which would explain a great deal.

    6. Adam 1

      Re: distributed knowledge?

      I'm happy to be downvoted but at least make a point about why my post is wrong or stupid or RTFA or something.

      @phuzz, thanks for the link. It's good to see they are at least making the right noises. I think it's a bit generous to call it an "all you do" set of instructions. Most commentards here could do it but it is hardly folding@home or seti@home level accessible. There is a lot of focus on the great backup but potential distributed restore plans don't seem as developed. Bad actors are mentioned in passing but not strategies to figure out which is truth when for example a TLA pretends to be multiple actors and restores a different truth.

      This would be an interesting application of blockchains or even with as a cryptocurrency. Imagine mining by proving that you have the hash of hundreds of random files from random places in the archive.

    7. Dan 55 Silver badge

      Re: distributed knowledge?

      Doesn't ipfs.io do this?

      Of course the danger with that is is that it's a start up so it's transient.

  6. alain williams Silver badge

    Archive vs right to be forgotten

    How do we square off the two ?

    1. steelpillow Silver badge

      Re: Archive vs right to be forgotten

      There is no right to be permanently forgotten - ask any archaeologist. A data protection filter during a person's lifetime would surely be a good thing, but it would require massive administrative overhead with affected people arguing over what items should be blocked/unblocked, I don't know if it would be feasible.

      1. ponga

        Re: Archive vs right to be forgotten

        I broadly agree with you that, as I conceive of natural rights, there is no absolute right to be forgotten. However, this does not mean that there are no legal problems to address: with the EU General Data Protection Regulations coming into force this spring, you will in fact have a general right to remove records from any organization storing your personal information (with some obvious exceptions for e.g. active business relationships and security). The usual tricky question of jurisdictions then raises its ugly head.

        In the very longest term, we will of course all be forgotten. Isn't that comforting?

      2. Doctor Syntax Silver badge

        Re: Archive vs right to be forgotten

        "A data protection filter during a person's lifetime would surely be a good thing, but it would require massive administrative overhead"

        The best filter is the one that lies on the proximal side of the user's fingers and requires no overhead, just a head.

      3. JimboSmith Silver badge

        Re: Archive vs right to be forgotten

        There is no right to be permanently forgotten - ask any archaeologist. A data protection filter during a person's lifetime would surely be a good thing, but it would require massive administrative overhead with affected people arguing over what items should be blocked/unblocked, I don't know if it would be feasible.

        I wondered what the difference between grave robbing and archaeology was. Someone did give me a definition which was basically Archaeologists don't hang onto their finds for profit Grave Robbers do.

        I think the Time Team folks are safe under that definition.

        I've managed to find a friend who I had lost contact with using archive.org. His business contact details were on his website which he deleted a few years back. Wouldn't have found him so easily otherwise.

        Whilst they do try to cut out the pr0n sites there are a few on there and hence archive.org is blocked in a few places where they're paranoid about pr0n. I once visited a company where they had blocked access via their internet connection to most adult sites but also Flickr, Instagram, Twitter, Dailymotion, Youtube etc. because they were trying to ban pr0n. A staff member told me it was a massive overkill but they were acting on the advice of lawyers.

  7. Christoph

    "copies of its data out of the US, because it's good to have an offsite backup."

    I would say that it is absolutely vital to have copies of the data outside the USA. And not just with the current regime. The Snowden revelations show clearly that for instance the NSA would have no qualms whatever in hacking in and changing the data, or simply ordering them to change it and forbidding them to say anything about it.

    1. alain williams Silver badge

      NSA & Internet Archive

      the NSA would have no qualms whatever in hacking in and changing the data, or simply ordering them to change it and forbidding them to say anything about it.

      Correct: so any second copy must be more than a backup copy of the ''master'' in the USA. It must have a certain amount of USA-hands-off autonomy so that it would verify updates from the USA and also scan web sites independently so that it is not blind to the sites that the USA government/judiciary says that the USA archive must not see.

      The big question is where to place the second copy ? The UK is likely too close (politically) to the USA, much of Europe is not a huge amount better. I have China and Russia popping into my head; sure they will censor things but likely in a different way than the USA/Europe.

      Why stop at two backups, if funds allow the more the better.

      1. Hollerithevo

        Re: NSA & Internet Archive

        Perhaps all the more reason to have the data distributed across a few million people's spare capacity?

  8. Aristotles slow and dimwitted horse

    *cough* Yes, sure... *cough*

    "The Internet Archive isn't so much concerned with preventing the spread of misinformation as with making sure information of all sorts remains accessible."

    Sure, because in 500 years from now when the domestic cats are evolved into our sentient upright overlords, and when we are slavishly subservient to them (even more so than we are already), it's good to know they'll have somewhere to go to revisit their own historical records.

    1. Hollerithevo

      Re: *cough* Yes, sure... *cough*

      And how does this differ with the situation between cats and humans now? They don't even have to be upright. Why bother, when they can be served when at their ease?

    2. JimboSmith Silver badge

      Re: *cough* Yes, sure... *cough*

      Sure, because in 500 years from now when the domestic cats are evolved into our sentient upright overlords, and when we are slavishly subservient to them (even more so than we are already), it's good to know they'll have somewhere to go to revisit their own historical records.

      Obligatory Sir Pterry Pratchett quote:

      In ancient times cats were worshiped as gods, they have not forgotten this

  9. steelpillow Silver badge
    Angel

    Essential service

    I have to say I already find the Wayback Machine an essential service. I blush to recall that I have even retrieved my own stuff from it on occasion, when my backup system failed.

    Top marks to these geezers, I wonder if they are archiving Wikileaks?

    Somewhere I came across a religious cult which regards information as Divine (taking "God is Truth" quite literally) and its destruction as a sin. Not so much a defunct Church as merely a change of religion, then. I can live with a God like that.

    1. Hollerithevo

      Re: Essential service

      Isn't God the Word (Logos)?

      1. Stuart Castle Silver badge

        Re: Essential service

        No, The Bird Is The Word,

        https://www.youtube.com/watch?v=2WNrx2jq184

        1. Gezza

          Re: Essential service

          @Stuart Castle - thank-you. Brilliant. I needed cheering up and that hit the button. What a classic.

    2. Andrew Orlowski (Written by Reg staff)

      Re: Essential service

      If it's "essential" do you mind that it's partial, and succumbs to corporate pressure? Or is "full of holes" good enough?

      https://forums.theregister.co.uk/forum/1/2017/11/16/head_like_a_memory_hole/#c_3349090

      "a religious cult which regards information as Divine "

      The cult is contemporary and Swedish, but information worship goes back to Gnosticism. After Comte ("Religion of Humanity), there were a number of religions of positivism. One disciple was Teixeira Mendes who put "Order and Progress" on the Brazilian flag.

  10. Anonymous Coward
    Anonymous Coward

    Agreed, and ElReg also should stop modifying posted articles without warning.

    That is all.

  11. 404

    This is why I collect encyclopedias and old books - they tend not to change when the wind blows from different directions*.

    *Russians obviously lol /s

  12. Andrew Orlowski (Written by Reg staff)

    Brewster and Memory Holes

    "We don't see people trying to modify the records that we've stored," Kahle told The Register.

    Archive.org seems very happy to modify the record itself. How do I know?

    Back in 2003, when Carly Fiorina as CEO, HP requested the deletion of material it found embarrassing, and Archive.org happily complied. I recall this made things difficult for us journalists to corroborate previous statements, and so hold the executives to account.

    So I find the Memory Hole competition richly ironic. Archive.org *is* the memory hole.

    Real archives have exceptions for copying and preservation, and the kind of threats HP made could be ignored. Don't mistake Brewster's collection for a real archive.

    1. Hollerithevo

      Re: Brewster and Memory Holes

      Thank you for this. It's like when \i found out that the British Library had thrown away a lot of books and old periodicals. Gutted my trust in them forever.

    2. Kiwi

      Re: Brewster and Memory Holes

      Back in 2003, when Carly Fiorina as CEO, HP requested the deletion of material it found embarrassing, and Archive.org happily complied. I recall this made things difficult for us journalists to corroborate previous statements, and so hold the executives to account.

      So I find the Memory Hole competition richly ironic. Archive.org *is* the memory hole.

      I wonder if, in the 14 or so years since, they've had a change of heart?

  13. EyeballKid

    Whose fault is it?

    Does anyone else think it odd that this "forever archive" is located in a building on a major fault line?

  14. ma1010
    Happy

    Sounds fine to me

    "... we don't archive Facebook very well."

    1. Robert Moore
      Coat

      Re: Sounds fine to me

      "... we don't archive Facebook very well."

      Yes, but it FriendFace went away, nothing of value would be lost.

      (Except the Cuke game, that was awesome.)

  15. Claptrap314 Silver badge

    35PB ? What about the Internet?

    I know one company whose data is measured in a larger scale than that. There is no way that these guys are archiving anything beyond a thin slice of the net. This explains why I wasn't able to pull comments from an old website--they just never archived it.

    1. David Nash Silver badge

      Re: 35PB ? What about the Internet?

      It's better than nothing though. I don't think they claim otherwise.

    2. Korev Silver badge
      Boffin

      Re: 35PB ? What about the Internet?

      The Research division of my company has more than this and we don't consider ourselves that big - CERN has multiple hundred PB of data.

  16. oneeye

    Gee, no bias from the author?

    WRONG! Since he completely ignored the last Democrat President, I only feel it necessary to give one of his famous quotes, to which I'm sure has been archived hundreds, if not thousands of times:

    "If you like your Doctor, you can 'KEEP' your Doctor" !!!

    1. Anonymous Coward
      Anonymous Coward

      Re: Gee, no bias from the author?

      Yeah, as if only one president has ever lied...

      "I am not a crook."

      "Read my lips, no new taxes."

      "I am not having sexual relations with that woman."

      Just about everything out of Donny's mouth/fingers.

      Face it, if you record everything a person said for four years, you'd have more whoppers than Burger King. The difference is, most politicians try to claim that they were misquoted, misunderstood, or had better "facts", they don't simply erase what they said. The author was pointing out that GWB kept changing the headline as it suited him, since it clearly didn't match reality.

  17. FozzyBear
    Angel

    Having the tech Storage in an alcove in a church, that is so American Gods

  18. Kev99 Silver badge

    Personally, I'd really rather no one where their umpteen petabytes of storage were stored. An old church is not the safest place in the world. They'd be better off if they used someone like Iron Mountain to house the storage.

  19. Paratrooping Parrot
    Linux

    Backups

    I hope that they have a good backup system in place. Distributed in different areas around the world. As for the speed of backup, just to paraphrase Tanenbaum: "Don't underestimate the data transfer rate of a van with backup tapes along the motorway."

    1. harmjschoonhoven
      Stop

      Re: Backups

      The problem is not to move the van. The problem is to fill the tapes.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon