back to article Data trashed? When RPO 0 isn't enough

World Backup Day came and went – did you notice? It seems the only thing we've learned is that everyone wants Recovery Point Objectives (RPOs) of 0. Unfortunately, aggressive RPO targets are hard. They affect the design of real world environments, and are sometimes not possible. And RPO of 0 means "no data loss". With RPO 0, …

  1. Anonymous Coward
    Anonymous Coward

    IN my experience

    Trying to achieve RPO of 0 or near zero is just as much a factor of application systems design never mind the underpinning datacentres and infrastructure.

    Architects beware - 0 RPO is a system problem, not a infrastructure one, although clearly infrastructure design and choices can play a role.

    Applications can be designed to assume some levels of infrastructure failure too...

  2. druck Silver badge
    Unhappy

    Pay per mention?

    How many times did you have to mention 3D Xpoint?

    1. Trevor_Pott Gold badge

      Re: Pay per mention?

      According to ctrl-f, 3 times. If you have a better example of a zoom zoom post-today's-NAND tech that is actually sampling amongst companies, I'd love to know. Or even a term that is better than "zoom zoom post-today's nand tech".

      If I say "3d xpoint" people get what I mean, even if xpoint isn't the actual technology that's relevant there. It's a placeholder. Kleenex, without being Kleenex branded tissues. But I'm open to a better term...

  3. handleoclast

    RPO 0 is good, but...

    Sure, RPO 0 is a good idea. I did something approaching that several years ago using DRBD on Linux to provide a HA Samba share. Same building, but storage at opposite ends of the building with a dedicated link between them. The objective was as much RTO 0 as RPO 0. Some people losing an hour's work is bad, but everyone being unable to do anything else for several hours would be worse. With DRBD failover took place within 10 to 20 seconds. All data saved to the Samba share immediately prior to the hardware failure would be available.

    But that was no protection against somebody accidentally deleting an important file. Or making an edit that removed a big chunk of a file. Or slow ransomware encrypting old files so you don't notice a problem until it's screwed most of your data. Etc. Which is why backups are essential, even with RPO 0.

    I used rsync for nightly backups. It gives what is effectively a full backup with a data transfer only slightly larger than an incremental backup, so can be (and was, in this case) used to backup to a remote site as well as a local machine. With suitable options, rsync also saves a copy of any file that has been changed or deleted. So for the price of daily incremental backups I had a full backup offsite that was no more than 24-hours old, plus several months of changes. Not quite as good as could be achieved with the hypothetical AI-controlled backups, but good enough.

    I don't know if rsync is available in the latest Windows bits-of-linux-addon but it's long been available in the Cygwin package and works quite well on Windows. Especially if you figure out all the shadow volume stuff so you can back up the files that Windows keeps open all the time, such as Exchange data.

    1. Mark 110

      Re: RPO 0 is good, but...

      Have an upvote. I am getting bored of pointing out to very experienced people that data replication doesn't give you RPO zero in some of the more common scenarios - ransomware, fat fingers, malicious destruction (OK the last one isn't that common fortunately).

      I've mentioned the time an admin anonymised the Prod database for a major financial institution before before . . . Backups, not replication saved the day eventually.

      1. Anonymous Coward
        Anonymous Coward

        Re: RPO 0 is good, but...

        It depends on the organisation. We use a combination of RAID, Network raid to an offsite location (but within latency limits), Dual hosts and backups - forever incremental and more traditional for some systems.

        However, for some organisation stale data can mean the data is completely useless and the organisation might no longer be viable and therefore all their concentration will be in replicating data. You may be able to find the employees details to pay them their redundancy pay but sometimes if you lose data and have no way of knowing what whas lost and no possibility to recover it, all the previous data is pointless. We're obviously not talking about files here but real-time, high transaction databases. Even for modest organisations a period of time will pass where to recover data and work our all the missing transactions and input them manually can mean that you end up with more transactions queueing to be written than you can place the older transactions into the system and it becomes impossible to recover.

        The key thing is, backups, DR and business continuity isn't easy. If you think it is then you are probably either a) running a very small org with mainly user created document based data to backup or b) you are doing it wrong and probably haven't done a fully realistic scenario to recover all your systems and all the logistics surrounding it.

        Despite the stories about major outages from big companies who struggle to get back up and running, I won't post here and say " err, have they not heard of backups, duh". I will be thinking, I'm so glad it isn't me that has to sort that out because there's probably a number of scenarios where I could be the one struggling to get the systems fully working in a reasonably useful timescale.

        1. Phil O'Sophical Silver badge

          Re: RPO 0 is good, but...

          One thing that many people forget when promising RPO 0 is that it's a double-edged sword.

          True RPO 0 means that no data you write can be lost, i.e. it is always written safely to the secondary data centre as well as the primary. Implicit in that is a requirement that your secondary data centre must always be available. If it goes down, or offline (even for maintenance) any writes at the primary won't get saved at the secondary until it comes back and catches up, so you don't have RPO 0 during that period.

          To put it another way, if you need to guarantee RPO 0, you have to halt primary site operation if the DR site isn't available. It makes the secondary site & network a single point of failure. Is that really what customers want?

          As the article points out, you can get nearly to RPO 0 with 3-site solutions where data is written synchronously to a backup in the metro area, where speed-of-light latency is manageable, and parallel asynchonous replication to a distant 3rd site. Lots of banking organizations are doing this to meet regulatory rules. There are some clever tricks you can play with database log shipping so that you don't actually need 3 full copies of all data, one at each site. Even so, I'd be very wary of any salesman who promised me permanent RPO 0. It's a good sign that he doesn't understand what he's selling.

      2. handleoclast

        Re: data replication doesn't give you RPO zero in some of the more common scenarios

        Actually, I'd say that the problem in some common scenarios is that data replication does give you RPO 0. The problem is that you needed RPO -1 because you want to get back to the data you used to have before some idiot accidentally deleted an important file.

        Well, as you and I both said, the problem isn't RPO 0 per se it's not having backups as well. RPO 0 serves an important purpose, but it's no substitute for having backups.

        Yes, there are business scenarios where only RPO 0 will do and backups aren't enough. And many more scenarios where RPO 0 is no good (malware, etc.) and only backups can help you. And a few fantastic scenarios where having both still won't help.

        On the whole, though, and for many common business models, and given the likelihood of various possibilities (nuclear strike and fat finger, for instance) I'd put a good backup system in place first and RPO 0 (if needed) second. Optimize for the common case...

      3. SimpleMindedGuy
        Coat

        Re: RPO 0 is good, but...

        Data replication (synchronous/RT) absolutely does provide an RPO of 0. Regardless the state of the data itself at the primary, the goal is to provide the exact same state without any delta in a secondary (or P') location. In the scenario posed here, the proverbial site obliteration would cause an instant failover to P'. Under no circumstances should there be any change in/to the live data presented to applications - even if it is infected by ransomware or contains human error.

        Snapshots are an effective way to mitigate the dangers posed here, but must also be delta-replicated to allow for P' to restore should any of the common scenarios you've mentioned occur. The issue with snapshots is they capture the entirety of the resource at a particular point in time and must be scoured for the particular data desired. A DR plan should include regular testing of snapshots mounted and restored should that ever be necessary, but also to verify their integrity and application consistency.

        Backups are easily the most common form of DP, but as the author points out, do not deliver an RPO of 0 unless they are continuous and (you guessed it), RT replicated to P'. Coupled with snapshots and application integration, backups definitely provide the most seamless and comprehensive way to recover locally and in a metro environment.

        I saw someone mention application-level awareness and involvement. Absolutely. Intelligence is the answer to handling ceilings in latency and throughput. The less data read/written the better. Storage systems provide the intelligence in the form of access pattern recognition, data pattern compression, deduplication, and delta-only growth and transfers in certain cases. Storage systems don't provide data or application-level intelligence or they'd just be servers with storage and you might as well host the apps right there and have each server handle its own protection (good luck with shared link replication).

        Combating costs is generally the inhibiting factor. CapEx chokes the ability (if distance and latency don't) to creating a good DR strategy. Often times it's okay to have slower drives and higher capacity in P' and simply run in a degraded (slower) mode should a disaster strike. The additional capacity and cost savings also allows for DR testing, snapshot testing/purging, application recovery, virus scanning, etc.

        Every environment and organization is unique in most respects. There is no single right answer that covers all. A combination of different techniques and products to create a final strategy which encompasses and captures mission critical apps, essential, non-essentials, RPO(s), RTO(s), data TTL, and scheduled testing is more important than simply stating "RPO-0". A documented DR plan and strategy needs to be created and followed with stated expectations...and then it's all your fault anyway :-)

    2. Stephen 11
      Thumb Up

      Re: RPO 0 is good, but...

      We used to use rsnapshot for a few of our samba shares. Always found that was one of the easiest systems to restore a file that was deleted or corrupted by a user, and it was quite space efficient.

  4. Beech Horn

    Ceph

    Appreciate not everyone can use it but doesn't Ceph solve the RPO 0 problem? Triple replication with writes to 3 different locations by default. Block, file and FS services, etc, etc

    1. Mark 110

      Re: Ceph

      Not if you are replicating deletions. I don't know Ceph so can't comment.

      So lets work this through. If all writes are synchrously replicated to a secondary (or thrid) physical location then in the scenarios where your primary location is desstroyed theen you acheive RPO 0. However if your primary location is fine, just someone deleted the data, encrypted the data, anonymised the data . . . then particularly if you are using sstorage level replication, all of that fuck up has been replicated to your other sites. There are ways of mitigating thiss if not using storage replication but I would not rely on them.

      Therefore. My prefered regime is regular backups. And regular log bacckups. The technologies and access contols enforced should mean that noone wwith permission to the production data also has accesss to the backups. The backups are replicated to a remote location.

      I normally enforce for critical high transactional systems a log backup regime of every 15 mins. This means I can promise an RPO of 5 mins in these scenarios. RTO is a different matter as I need a backup restored and T-logs run before I can restore service. I refuse to promise RPO 0 even though in some scenarios (physical destruction, major network outage, etc) I can acheive this.

      As stated by my associate commentard earlier, you can get lower than 15 mins with certain approachss (log shipping) but theres some expense in network bandwidth and securiity controls. I havve yyet to woork with an architect that went for log shipping in response to my data protection demands.

  5. Shameless Oracle Flack

    Leverage Appliances to Achieve Zero Data Loss for Oracle Databases

    Good review of cost-tradeoffs for achieving RPO of 0 (i.e., zero data loss) or near zero across your full software stack via your standard block storage platforms and related tools. The problem is that without deep integration with the database, where most corporate data lives, you can't achieve this. No block storage device can give you this 100% (although it can come close with crash-consistent recovery).

    START SHAMELESS COMMERCIAL PLUG

    In contrast, Oracle's Recovery Appliance can efficiently provide zero data loss for your Oracle databases, where the most important corporate data resides, with flexible topologies to meet varying redundancy requirements and seamless integration with Oracle and third-party tape. This solves your hardest and most important "zero data loss" problem without a lot of complex development, integration, and testing work on your part, while integrating directly into Oracle's OEM management tool platform already being used by your DBA's today.

    END SHAMELESS COMMERCIAL PLUG

    But as the first comment pointed out, devices such as Recovery Appliance and others mentioned my Trevor must be considered in the light of the complete application software stack to achieve the right service levels for data resilience and data availability back to users and other services.

    1. Anonymous Coward
      Anonymous Coward

      Re: Leverage Appliances to Achieve Zero Data Loss for Oracle Databases

      Refreshing to see a shameless plug made openly... we'll give you this one.

    2. Mark 110

      Re: Leverage Appliances to Achieve Zero Data Loss for Oracle Databases

      By far the best data protection regime I worked on was on a Oracle Maximum Availability cluster. strecthed across two data centres. Hopefully I got the terminology right - was a few years ago. I threw a bunch of scenarios at it in OAT and it did what it said on the tin.

      Currently working on some SQL AlwaysOn stuff which seems to try to get to the same place but theres no appetite (budget) for the level of testing I threw at that Oraccle cluster.

    3. Mark 110

      Re: Leverage Appliances to Achieve Zero Data Loss for Oracle Databases

      And I have to say that whilst your technology is amazing, I struggle to recommend. Your licensing is a ballache and seems designed to allow people to fall into awful exposure situations without realising what they are doing.

      I know you need to make money but put some effort into making it easy for customers to just buy what they need please. And make it eay for them to check their license compliance, know what its costing them and pay what they owe you. I do not want to have to sort another shit storm out.

      1. Shameless Oracle Flack

        Re: Leverage Appliances to Achieve Zero Data Loss for Oracle Databases

        You're right, totally hear you on that.

  6. Anonymous Coward
    Anonymous Coward

    An underground bunker would solve most of the natural disaster problems. You could even put earthquake protection into it as well. Lets face it, if you're paying silly money for the equipment you might as well be silly with the building.

    Maybe quantum entanglement could one day solve the latency problem.

    1. Mark 110

      It underground. Vullnerable to Daemons . . .

  7. Velv
    Boffin

    Third site...

    And if you want to do this properly, you're going to have a fourth site.

    Like it or not, maintenance must be done at some point. And while you're conducting maintenance that element of your service is offline and your protection is at risk. One incident elsewhere during maintenance and you've potentially lost data.

    As said, to really do this properly is expensive. Alternatively the business need to sign off that there are failure scenarios they are not protected against. You'd be amazed what they will sign off when you put a £50,000,000 bill in front of them. "oh, if that's the cost, I guess we can risk losing 15 minutes of data"

    1. Mark 110

      Re: Third site...

      Spot on. Everyone can afford to lose 15 mins when they see the bill for going that bit better.

  8. elil

    Since most issues that happen to data are logical ones (human error, application failure, cyber attack and in particular ransomware), if you could take a snapshot every second, you'd be able to have RPO 0.

    Our storage product offers exactly that - we allow customers to go back to any second in the past and create an immediate clone.

    Ping me for additional details Eli@reduxio.com

    This of course does not protect against physical damage to the storage device, data center, site etc...

    For that, replication is the only alternative

    1. Doctor Syntax Silver badge

      "if you could take a snapshot every second, you'd be able to have RPO 0."

      In a high transaction rate application you'd lose quite a bit of data in that time.

    2. SimpleMindedGuy
      Coat

      "This of course does not protect against physical damage to the storage device, data center, site etc...

      For that, replication is the only alternative"

      Enterprise data center requirements around RPO-0 implicitly require replication and data center FT. Reduxio appears to be a nice point-product/solution, but certainly not enterprise-class storage that can meet the requirements of Tier 0 and 1 applications/data. Reduxio still relies on backup software or other form of host-based protection to deliver recovery and RPO expectations.

      I looked at the product briefs and it sounds as though Reduxio is a server chassis with ?2 nodes? and didn't see mention of active/active controllers. I'm led to believe the cache then is not mirrored between the 2 controllers which could also be an intolerable point of failure should there be a power outage - unless you are employing some type of super capacitor to drain the cache before emergency shutdown. Regardless, whatever in-memory logic is applied must be flushed to stable storage first before acknowledging the host to facilitate any controller failover or NDU. Mixed workloads and OLTP will crush this type of platform in an enterprise environment especially since there is non-stop tracking logic and metatdata updates for time traversal - sounds as though flash will eventually suffer the speed of the spinning disks at some point (even with intelligent destaging logic) and introduce tremendous latency. I also noticed there was no FC host interfaces which is still prevalent in most enterprise shops.

      Seems like a neat product with some application, but I don't think talking about RPO-0 in this context makes sense for Reduxio.

  9. hellwig

    Space-based hosting.

    Why not host Pseudo RPO 0 data in space? Sure, cost is significant, but rather than worry about your distributed data centers being overcome by some earth-based disaster, imagine being the only company to have data survive the next biblical flood or massive solar-flare/EMP (assuming that your space-based repository would be properly shielded from such dangers).

  10. Anonymous Coward
    Anonymous Coward

    Distance no problem

    As mentioned by an earlier poster, distance isn't really the issue, its the tolerance of your application to latency and how much you are willing to pay. How does an 800km round trip sound?

    https://www-935.ibm.com/services/uk/cio/pdf/gben-00.pdf

  11. Mike Timbers

    Much of what's being discussed here is for dealing with physical failure but it's good to see virtual failure scenarios also being discussed. I once was involved with an Oracle database that got a logical block corruption. Not only would it not re-start, we couldn't do a restore because we'd done bit-level backups so they contained the same logically-corrupted block. The block had been written four days earlier so to go back before that point would have meant losing four days worth of data.

    It's not therefore a question of instantaneous data replication across n+1 sites at zero latency. True RPO-0 means being able to write the same raw data to two totally different storage (software AND hardware) solutions across multiple locations. No-one is going to pay for that.

  12. DrM

    Thanks

    Good article, provoked my thoughts.

    My customer's main solution tends to be printing paper trails.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon