back to article Well now you node: They're not known for speed, but Ceph storage systems can fly

It has been revealed that open source Ceph storage systems can move a little faster than you might expect. For those who need, er, references, it seems a four-node Ceph cluster can serve 2.277 million random read IOPS using Micron NVMe SSDs – high performance by any standard. Micron has devised a 31-page Reference Architecture …

  1. baspax

    6ms+ w NVMe

    The only way to achieve such abysmal write latency is to have the world’s slowest data protection algos.

    Are they sending the acks via pigeon carrier?

    1. Korev Silver badge
      Alien

      Re: 6ms+ w NVMe

      You mean rfc1149?

    2. Korev Silver badge
      Joke

      Re: 6ms+ w NVMe

      Maybe Ceph is short for Cephalopod

      1. Crypto Monad Silver badge

        Re: 6ms+ w NVMe

        The article wrongly states that the reference architecture requires 3TB (!) of RAM in each node.

        If you read the document, you find that the servers are *capable* of 3TB, but the reference configuration uses 12 x 32GB DIMMs = 384GB.

        (Still quite a lot though)

      2. Crypto Monad Silver badge

        Re: 6ms+ w NVMe

        > Maybe Ceph is short for Cephalopod

        Err, yes it is. The company was called "Inktank" before being bought by RedHat.

      3. Anonymous Coward
        Anonymous Coward

        Re: 6ms+ w NVMe

        Isn't Cthulhu a cephalopod ?

    3. CheesyTheClown

      Re: 6ms+ w NVMe

      I was thinking the same thing. But when working with asynchronous writes, it’s actually not an issue. The real issue is how many writes can be queued. If you look at most of the block based storage systems (NetApp for example) they all have insanely low write latency, but their scalability is horrifying. I would never consider Ceph for block storage since that’s just plain stupid. Block storage is dead and only for VM losers who insist on having no clue what is actually using the storage.

      I would have been far more interested in seeing database performance tests running on the storage cluster. I also think that things like erasure coding is just a terrible idea in general. File or record replication is the only sensible solution for modern storage.

      A major issue which is what most people ignore on modern storage and is why block storage is just plain stupid is transaction management on power loss. Write times tend to take a really long time when writes are entirely transactional. NVMe as a fabric protocol is a really really really bad idea because it removes any intelligence from the write process.

      The main problem with write latency for block storage on a system like Ceph is that it’s basically reliably storing blocks as files. This has a really high cost. It’s a great design, but again, block storage just is so wrong on so many levels that I wish they would just kill it off.

      So if Micron wants to impress me, I’d much rather see a cluster of much much smaller nodes running something like MongoDB or Couchbase in a cluster. A great test would be performance across a cluster of Latte Panda Alpha nodes with a single Micron SSD each. Use gigabit network switches and enable QoS and multicast. I suspect they should see quadruple the performance that they are publishing here for substantially less money.

      Better yet, how about a similar design providing high performance object storage for photographs? When managing map/reduce cluster storage, add hot and cold as well, it would be dozens of times faster per transaction.

      This is a design that uses new tools to solve an old problem which no one should be wasting more money on. Big servers are soooooo 2015.

      1. Anonymous Coward
        Anonymous Coward

        Re: 6ms+ w NVMe

        > Use gigabit network switches ...

        You forgot the joke icon... for most of the points in your post.

      2. disk iops

        Re: 6ms+ w NVMe

        > I also think that things like erasure coding is just a terrible idea in general.

        > File or record replication is the only sensible solution for modern storage.

        Hardly. Triple replication for small objects and EC for large is specifically intended to guarantee integrity and availability of data when things get silently corrupted or nodes become unavailable. The obvious tradeoff for this is "wasted" space for duplicates and CPU time to compute EC both on write and read.

        Using Ceph to run live VMDK (as opposed to storing the initial bootstrap and subsequent snapshots) is nuts, I agree.

        Netapp has fast "write ACK" time because it's simply written to battery/flash-backed RAM and can de-stage at it's leisure. Till of course the write load overruns the ability to checkpoint and flush said 1st level cache and even the 2nd tier if so equipped.

        Object/Ceph store for RDBMS workloads would be appalling. Operating System disks are pretty much write never so if there were a way to implement NFS-root on top of Ceph without a lot of work that might be interesting. Or replace EXTFS with native in-kernel CephFS, that might be something.

  2. Anonymous Coward
    Anonymous Coward

    Still not enough CPU

    Not sure how many cores (48 or 56?) but still not enough. 1/2 the drives would still max out the CPUs. Don’t design storage systems like this.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like