storage
should have used zfs
GitHub's website remains broken after a data storage system failed hours ago. Depending on where you are, you may have been working on some Sunday evening programming, or getting up to speed with work on a Monday morning, using resources on GitHub.com – and possibly failing miserably as a result of the outage. From about 4pm …
From https://blog.github.com/2018-10-21-october21-incident-report/
"At 10:52 pm Sunday UTC, multiple services on GitHub.com were affected by a network partition and subsequent database failure resulting in inconsistent information being presented on our website."
Anyone care to talk us through how zfs addresses this?
Network partition followed by database "failure" suggests some sort of clustered system with no split-brain protection. Two halves of one cluster each thinking they are the "one true cluster" and writing data to shared or replicated storage with no protection. It's a typical mistake made by home-grown cluster software that doesn't fail safe.
Plenty of ways to avoid that with decent clustering software, with or without ZFS.
I remember nexenta's zfs solution to that problem was to corrupt the zfs file system and then get into a kernel panic reboot loop until the offending block device(and filesystem) was removed. Support's suggestion was "restore from backup". I tried using zfs debug tools (2012 time frame) to recover the fs. Was shocked how immature they were. Wanted a simple "zero out the bad shit and continue", didn't exist at the time. Disabled nexenta HA after that. Left nexenta behind a while later.
I tried using zfs debug tools (2012 time frame) to recover the fs. Was shocked how immature they were
One of the irritating things about ZFS design, even today, is the insistence that the combination of checksumming & copy-on-write mean that it "can't" get into a state of internal corruption, so fsck-like tools aren't necessary.
IMHO, it has never sufficiently taken into account the way that a FS can be externally corrupted by storage problems or administrative misconfiguration.
The ZFS approach to checking if a filesystem (pool, really) is OK used to be (and may still be) to attempt to import that pool. In other words all the consistency checks run in the kernel where any kind of error in the checks (I mean a mistake in the code which checks for on-disk errors) is probably going to cause the machine to fall over in some horrible way. That kind of made it obvious to me that the ZFS people had never been very near a production environment and certainly never should be allowed near one.
They're great and I genuinely think they're the way forward but they're just not 100% there yet. I think that huge strides are being made and these are just teething problems as we all adjust to the new world, however we as customers need to use these services in tandem with our own on-prem, securing our own data and making sure we're fully ready before we make the complete move.
"...'The cloud' is once again overrated.
All eggs, in one basket [even a distributed basket] is not necessarily a good idea.
I like github but I don't stake my business on it always being there. A lot can go wrong between my computer and their servers. A lot..."
Worryingly, that's twice in a single month you've to only made sense, but I find myself generally agreeing with you, bob
However, like I've said before on here, just because something is in "the cloud" doesn't and shouldn't absolve the owners of the data/service of their responsibilities. These are usually the same people who wouldn't bat an eyelid if told - correctly - that you wouldn't trust the data/service to a single point of failure they own themselves.
And yet we still see this "throw it over the fence and it's someone else's issue" mentality time and time again.
"Cloud services" can work well. But they are not a panacea and they still require some levels of simple management and accountability.
It isn't really cloud, though, is it?
Not if one data storage thing going offline causes the whole thing to fall over. It's more like a Drip. Maybe a Puddle.
Whether or not it's "cloud"... where's the failover? And I mean failover, not just "oh, have some stale data and we may be able to restore a backup"... but live storage somewhere else ready to take over. You'd think $7bn might be able to buy something like that, no?
It doesn't matter whether it's cloud or not - it's SHODDY. Storage failures should never get to the point where they affect users, because you should have enough redundant storage mirrored up to date, and via a versioned filesystem so even a "delete all" command can be undone, for it not to matter.
If you're basing your business on their services, immediately review that decision. From the looks of it, they are just running off stale caches at the moment. That might mean they have no data actually up at all.
'Live storage somewhere else ready to take over' is why banking IT is expensive. My guess what most of the cloud people do is, at best, 'storage somewhere else ready to take over which is in a consistent state and no more than a few transactions (of whatever nature: git commits here) behind the current live storage. May be that's enough.
For GitHub consumers this is one of the lesser cloud deployments since cloning a Git repository by default involves making a full local copy, and all operations are performed locally and then merely synced to remote.
Git doesn't even enforce any sort of topology — e.g. an international company that used GitHub could have local copies of all repositories that act as remote for all local developers and which sync up to GitHub from that single point; GitHub would then be the thing that permits cross-site work, and the authoritative copy.
What you lose is GitHub's additions to Git: the pull requests, the issue tracking, etc. Or, in this case, I guess you can still see slightly historic versions of those things effectively in read-only mode.
So I don't think I'm ready to jump on the cloud-is-a-bad-thing bandwagon in this particular use case. It's slightly more of an adjunct rather than a full solution, but the downage needn't be an absolute stop to work like it would be if, say, you were in the business of modifying and reviewing legal documents, and were just keeping them all on One Drive/Google Drive/DropBox/whatever, which vanished from sight.
So, ummm, just think about what you're paying for and be sensible?
The problem is that although git can do all that -- you can ship updates by email I'm pretty sure (and not just the git format-patch thing but commits), so the connectivity requirements are tiny in theory -- people (a) really, really want the issue-tracking stuff (b) in practice treat git just the same way they treated subversion and CVS, with a central system which runs everything, and (c) want it to be free. And that central system, for many people, is GitHub, so when it goes away the same doom befalls them that befell them when google code went away and when sourceforge went away before that (I know it, sort of, came back). And there's almost no collective memory -- anything that happened more than a year or so ago is forgotten -- and so the wheel of reinvention turns forever.
The problem is that although git can do all that -- you can ship updates by email I'm pretty sure (and not just the git format-patch thing but commits), so the connectivity requirements are tiny in theory -- people (a) really, really want the issue-tracking stuff (b) in practice treat git just the same way they treated subversion and CVS, with a central system which runs everything, and (c) want it to be free. And that central system, for many people, is GitHub, so when it goes away the same doom befalls them that befell them when google code went away and when sourceforge went away before that (I know it, sort of, came back). And there's almost no collective memory -- anything that happened more than a year or so ago is forgotten -- and so the wheel of reinvention turns forever.
Don't entirely agree here. Even with a master repo (which you may put on github or not, as you wish), git still behaves fundamentally differently to both CVS and Subversion in a number of ways.
CVS is the most obvious, there is little concept of repository-level state unless you tag or branch. So a changeset will involve different revision changes for different files. History is completely in your repository, so lose it and all you have is a copy of the code, lose connection to it and you can't commit or checkout different versions.
SVN is a bit better, with the concept of a revision for the state of the code. However branching is mixed up with the directory structure of your repository. I haven't used SVN for ages, but the merge tracking feature wasn't introduced till 2008, I guess inspired by git. Again, history on repo only.
Git, locally stored history, even if you don't use it. Branching and merging on a graph based model, tools that let you diff freely across branches, commits and history. Using it with a central or master repository model doesn't actually detract from this. You can still commit, branch, checkout with no connection. Lose the remote and you still have all history and can make a new master. A central repo becomes a useful point of coordination, but it's no longer an Achille's heel. Lose it and you lose the extra nice stuff that's tied to the web-service, but not the actual commit information you would for subversion or CVS.
It's not about what git can do, it's about how people use it, and particularly that they expect there to be a big central system and are lost without it.
I'm also pretty sure that unless you do work to avoid the problem you're also in trouble with git if the central system you rely on goes away because you generally won't have all its commits. The documentation for 'git fetch' says, in part,
Fetch branches and/or tags (collectively, "refs") from one or more other repositories, along with the objects necessary to complete their histories.
which, I think, means it only fetches the commits it needs, and not commits associated with refs you're not fetching. So I think that means that pulls generally don't pull branches &c which you aren't tracking. In a busy repo that could be a lot.
I might be wrong about that but it would be easy to check I think. I don't know because I'd never use GHitHub as my big central repo but have origins which sit on storage I control and I'm generally very careful about making sure I have complete clones when I need them.
You should move to gitlab. They never screw up.
One nice thing about gitlab is that, if you don't trust all this cloud malarky, you can host your own instance.
Which you are then free to lose in the datacentre meltdown of your choice, but at least it will be your datacentre meltdown.
(At least it's git, you can still branch and merge locally, right? And pull & push from colleagues.)
That's the theory, but without a master repo things can get a bit hard to manage if you are pushing and pulling between multiple clones. I think this is a big reason for the success of GitHub and GitLab: the workflows you can build around having a master repository to manage the branches and merging. However, yes, just keep making commits and push once things are working again. If the master repo is utterly lost then spin up a new one and push one of your local clones to it (however you'll have lost all issue and merge request history, and will need to set up any c.i. again).
"Which you are then free to lose in the datacentre meltdown of your choice, but at least it will be your datacentre meltdown."
You say that as a joke, but I really believe that's a huge advantage to on-prem hosting.
Yes, only half a joke. If you can afford to do it right (including paying for the knowledge as well as equipment), then you know where your data is (hopefully offsite too...), what arrangements you have for backup and recovery, and what events and failures you're ready for. Cloud can be easier, and maybe give better availability than you can achieve for a sane level of cost+complexity, but aside from a number in a contract (optimistically) it's hard to be certain that Google or Amazon have more than one copy of your stuff, or what might bring it down.
So if the 2018 outage was a Microsoft curse, what caused the 2017 ones?
https://m.slashdot.org/story/329399
I'm no MS lover, but for Pete's sake this consistent MS bashing is just sad.
By all means slam then for the utter debacle that is Skype/...for Business/Lync/Teams or for the scrawling mess that is Azure etc. But Cloud Services go down all the bloody time.
Even GitLab - https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
Have some perspective people. It's a Cloud Service - and that means irrespective to the owners or prospective owners, dumb shit that affects a lot of people will happen.
>MS bashing is just sad.
Well, technically speaking you are right since purchase is still ongoing. But, looking at the headline I just knew MS was going to get bashedpowershelled and was looking forward to it. And also to the logic contortions that the haters were going to be use to somehow blame them.
What, you’re always reasonable, fair and mature?
Joking aside, github effups are very high profile, MS best take that into account once they do run it.
The issue is probably human error. Probably a misconfiguration or alerts that were ignored or not received at all, because infrastructure is easy and any developer who can bang two lines of Java together automatically knows everything there is to know about infrastructure as well.
That's the Devops way.
Some people look at HA as being protecting against single points of failure and stop there. You don't hear about all the times 1 thing happened and some point later normal protected running was re-established. You have to plan for more than 1 event. 3 tips to start with.
1- Don't turn check sums off. If a supplier suggests you turn check sums off on a production system work out how quickly you can stop using that supplier. Check if any benchmarks or certifications cited involved turning checksums off - and if they did demand the ones with checksums enabled.
2 - A snapshot is not a backup.
3 - A replicated snapshot is not a backup.
What are you protecting against? Snapshots certainly are backups. As is RAID. It doesn't protect against everything certainly.
Just today on an isilon cluster i was deleting a bunch of shit and i wasn't paying close attention. Realized i deleted some incorrect things(data was static for months). i just restored the data from a snapshot in a few minutes.
I've been through 3 largish scale storage outages(12+hrs of downtime) in the past 15 years. It all depends on what your trying to protect and understanding that.
In my ideal world I'd have offline tape(LTFS+NFS) backups of everything critical stored off site(offline being key word not online where someone with a compromised access can wipe out your backups etc). This is in addition to any offsite online backups. Something I've been requesting for years. Managers didn't understand but did once I explained it. It's certainly an edge case for data security but something I'd like to see done. Maybe next year..
understand what your protecting against and set your backups accordingly.
Snapshots certainly are backups
Not really. They are a frozen point-in-time image, but they work by storing original copies of blocks that get changed after the snapshot time. For any unchanged block in the snapshot, the original filesystem is still the underlying source of the data. Take a snapshot of a filesystem, then remove or corrupt the original filesystem, and your snapshot is worthless.
For any unchanged block in the snapshot, the original filesystem is still the underlying source of the data
I guess it depends on how you understand the terminology. To me what you're describing is an incremental backup.
A snapshot to me is a backup of a moment in time but that backup contains everything needed to rebuild the system. E.g. dd if=/dev/sda of=/dev/sdc is a snapshot/backup, call it what you will, of sda that contains everything you need to rebuild that disc.
I guess it depends on how you understand the terminology. To me what you're describing is an incremental backup.
In a way it's the opposite. An incremental backup is a set of all the data which has changed since a particular moment, which can be added to a full backup to get the latest state. A snapshot is the opposite, it's a list of the data which has changed, but it contains the unchanged data, not the changes. The idea with the snapshot is that you can go back to that point in time, even if things have changed in the meantime. In both cases, though, you need the full dataset as well, neither snapshot or incremental backup are of any use without a full copy of the data, since they only reflect changes..
A snapshot to me is a backup of a moment in time but that backup contains everything needed to rebuild the system. E.g. dd if=/dev/sda of=/dev/sdc is a snapshot/backup, call it what you will, of sda that contains everything you need to rebuild that disc.
The thing about a snapshot is that it is instantaneous. A dd of a disk will take a finite and possibly long time to complete, during which you need to block all activity to keep it self-consistent. A snapshot freezes an instant of a running system without any visible impact. You can then, of course, make an offline copy of that snapshot, with dd or anything else and you'll get a complete and consistent copy of the filesystem at that frozen moment. It won't matter if the filesystem is changing while you're doing that, the snapshot will protect you from the changes. That offline copy is certainly an independent backup, but the snapshot alone isn't.
I think a lot of people would define a backup as 'a (possibly partial in the case of an incremental) copy of something on physically- and logically-independent storage' In that sense a neither a snapshot nor a RAID system is a backup (a detached mirror might be in the right circumstances).
"Can you restore your service already?" software dev Saishav Agarwal
FFS does this guy think the GitHub people are just sat there thinking "oh well, we could get it working by pressing Enter... But hey let's wait a while until someone asks. We get it all the time when something is down "Can't you just get it back up?!"... Oh that is what we are supposed to be doing. fsck off
This man page?
https://xkcd.com/1597/
Anyone unfamiliar with XKCD should take a moment to appreciate the alt text for that:
"If that doesn't fix it, git.txt contains the phone number of a friend of mine who understands git. Just wait through a few minutes of 'It's really pretty simple, just think of branches as...' and eventually you'll learn the commands that will fix everything."
(Found myself in that position a while back when a merge of a reverted merge merged the reversion - that saying that three times fast - and removed a couple of months of changes. Fine, they're still at the last commit, resetting HEAD to the previous revision got them back, but now the branch had been merged and then wiped out, not possible to merge again. What to do? Discovered cherry-picking commits from the old branch onto a new one was a solution.)
For all it's on films and TV, I've never encountered people beeping in traffic jams*. Maybe it's an American or Continental Europe thing, whereas on these isles we tend to resign ourselves to being stuck in traffic.
*(Unless the cause is that car in front that takes forever to move off on green, or someone deliberately blocking lanes/yellow box juntions)
Ah, the wild and wonderful world of the cloud
people tend to forget that cloudy services still is connected to the wibbly wobbly web
if it wobbles, off your data goes
same with the cloud, if it wobbles, off your data goes as well.
Recovery in the first instance depends on the level of damage (power failure, vacuum cleaner, lightning strike, BOFH act, backhoe digging into a crucial backbone) - no data will be lost except downtime for which the company have to pay in lost hours if it does not have a redundant internet link in place... (the BOFH act may range from a simple upload of the wrong firmware to the router, to an epoxy-resin welded circuit breaker, to a rm -rf * on the Boss's terminal)...
Recovery in the second instance depends on the cloudy provider's IT crew (and hardware should it be required)... and yes, data may be lost, depending on the severity.
They should already BE on another box. Several other boxes in fact. That's called a fault tolerant system.
Every developer whining that they can't do any work because this system is down for a day should use their time productively by contemplating how their reliance on a single point of failure has been foolhardy. The rest will carry on with their work, having prepared for this situation in advance.
Yeah I wonder what kind of setup all those people who "cannot deploy" have.
So they don't have a local copy of their git repo? Even though git makes this dead easy? Or they have hardcoded all github references so they cannot deploy from their local repo?
I mean, I understand that they cannot access "Issues" and file "Pull requests", but that must be manageable for a single day.
FWIW, I could just pull and push from my Github repo, so it seems only the front-end is facing issues. Not the actual backend git storage. As already mentioned in the article.
It sounds like there's different versions of your data in the repository, and it's not guaranteed what version you will observe if you were to access it at present. The term, I believe, is "eventual consistency."
That's fine if you understand that limitation, but many won't even be aware that it can happen.
that apparently is also not up to snuff..
If you are going to provide storage for rent : it'd better be backed up , have live fallover machines and be geographically dispersed so disasters (earthquake, meteorites, tsunami) have no impact either.
I'd rather lose my data because my stuff is failing than have to growl at someone else because of his hardware failing..
Github was around before anyone was talking about "the cloud". They were (and are) a free git hosting service, where you can pay for ugraded services.
It is senseless to complain about availability on the free side. Pay for it or run your own. (A previous employer of mine was running private GitHub--they sell that.) And shame on you and fie on your business if you have paying customers and are relying on a free service in order to satisfy your contractual obligations.
If their paying customers are having a problem, then this is a much more serious matter. The reports suggest that they are using some bs "eventually consistent" model of data resilience. Best review those contracts.
That's the problem with all this 'free' stuff ... when it disappears due to lack of funding you are shit-out-of- luck, and then will have to pay through the nose to recover.
If you are on a budget : Get your own TWO NAS machines, subscribe to 100$/a year storage service such as OneDrive, pCloud or Sync and sync your NAS to the cloud storage.
Your local machine folders Sync to the NAS. Your NAS syncs to the cloud. If one of your NAS machines goes down ( and they will: drives will fail , motherboards will fry , updates will be botched ) you have a fallover and a cloud version.
Oh and : 1 copy is NOT a backup. 2 copies is only half a backup. And don't store stuff in the same geo location...
Indeed the company I work for pays for github access allowing for private repos amongst other things.
Do we get to have a good gurn then?
I thought the whole point of freemium is that it was almost like the old shareware of old that never quite expired after the 30 days - get people using it, used to it, then when they need to use the tool professionally they'll pay for it for familiarity.
If github can't offer a reliable free service then they should not offer a free service. Certainly they couldn't offer a reliable paid-for service either.