back to article Walk with me... through a billion files. Slow down – admire the subset

If you ask your notebook's filesystem how many MP3 files it is storing that haven’t been opened in 30 days, you can find the answer reasonably quickly. But ask an enterprise’s file system when it holds a million files and you have a big problem. Ask this question of a file system that holds a billion files and your day just …

  1. Doctor Syntax Silver badge

    As I started to read the article I found myself thinking "these guys need a proper database" and then - whadya know - they invent a database for the job, B-trees and all.

    I always reckoned that the Unix file system design reflected the database technology of its day and, although the implementation has changed somewhat to allow larger discs, journalling, remote storage etc, the design of the interface has been more or less frozen since then. Perhaps it's time to move forward and at the same time build in some protection against malware and its effects.

  2. Lysenko

    Is this a press release or actual journalism?

    There's nothing remotely new or innovative about storing metadata in a database to speed up file searches. In Linux terms, [find] does tree walking and [locate] leverages an indexed DB to do the same job. Filesystems that incrementally [updatedb] in real time (as against batch mode) aren't new either.

    Now, I'm not saying these guys are "Emperor's new clothes", but leading with trivial features and ridiculously brain-dead tree walk time calculations doesn't do them any favours. There were indexed file searches in the early '90's on NetWare. How about covering something a bit more impressive, like their capability (assuming it exists) to dedupe files that differ only in terms of metadata?

  3. Phil Bennett

    Object storage? Alternatives?

    This article reads a lot like an advert, but being charitable and assuming it's just a regurgitated press release, how does QFS compare to other modern large scale storage architectures?

    While there are tasks that are slow on traditional file systems, like the slightly artificial example given, the feature that is most requested and causes most hassle is search within file, and you shouldn't be using the filesystem for that - you should have something external like elasticsearch. Once you have external management software, keep the smarts there and let the filesystem store files. If you need something using the metadata and it's going to take ages to run, you've not correctly designed your infrastructure.

  4. Christoph

    "Assume each node access requires a disk access and this takes 10 milliseconds; then a 10-node file system would take 100ms roughly plus the access needed to walk back up the tree; say 150ms being simplistic."

    Well yes - if you first switch off all the caching. With modern memory sizes most of your reads will be to stuff already cached in memory.

    1. Mage Silver badge

      reads will be to stuff already cached in memory

      No, they won't. Only for stuff you accessed recently. Certainly not for all the files not accessed since over a week ago.

      1. Christoph

        Re: reads will be to stuff already cached in memory

        A directory read will pull in at least a whole disk sector, with other directories included. When you go down the tree you don't throw away the directory bits you've already read. Any decent system will optimise all this as much as it can. It's not perfect but it's far better than accessing the disk every single time.

        1. Anonymous Coward
          Anonymous Coward

          Re: reads will be to stuff already cached in memory

          The problem is with a billion files at 256 bytes per inode, even ignoring directories, that's 256GB of inode data. That's a lot of RAM to dedicate to caching inodes, and then you have to cache the directories as well, so you're probably up to 300GB.

          Sure, you can get servers with enough RAM that that's feasible to do, but even if you can configure the OS to cache that much the OS itself may become a limiting factor. It might not search a cache that large very well, as it will have been designed/tested to deal with inode caches orders of magnitude smaller.

          Not excusing the advertisement article, it is acting as if a b-tree based filesystem is some sort of new innovation. Personally if I needed to know how long it had been since a given file was accessed I'd have a background process walking the directory tree and looking for changes and building a database. Using the last access inode entry you can tell which subtrees you don't need to bother walking and updating, making the process far more efficient than what is described in the advertisement article.

        2. phord

          Re: reads will be to stuff already cached in memory

          But who caches directory data from network storage systems? Multiple writers invalidate caches readily.

  5. Mage Silver badge
    Thumb Up

    can’t retrofit … metadata generation, storage and access to an existing file system

    Yes you can, though it might take quite a while.

    IBM had the idea, before OS/2, that really an SQL type Relational database should be part of a filesystem. I always thought that in principle it was a good idea.

    Using file endings and file names to identify content is itself a bit doomed anyway. Taking the IBM idea I worked on the idea of multiple dimensional document space that used a database and then at lowest level used any arbitrary OS and Filesystem to actually implement the object storage. So "revision / version" control would be managed and you could navigate by time, kind of content, author and revision. I started in 1990 and gave up in 1992.

    This doesn't seem new or innovative. There have been library/document/management/database systems maintaining this sort of indexing for years.

    Existing applications are more a problem than existing storage. Both are solvable.

    1. DJO Silver badge

      Re: can’t retrofit … metadata generation, storage and access to an existing file system

      IBM had the idea, before OS/2, that really an SQL type Relational database should be part of a filesystem.

      Yes, I always thought MS missed the boat there, they should have baked SQLServer into Windows and used it to manage the file system, couldn't be worse than FAT.

      1. Lyndon Hills 1

        Re: can’t retrofit … metadata generation, storage and access to an existing file system

        Yes, I always thought MS missed the boat there

        They did have plans a while ago....

        winfs at wiki

      2. Ken Hagan Gold badge

        Re: can’t retrofit … metadata generation, storage and access to an existing file system

        Umm, whilst MS support FAT, it hasn't been their preferred choice for over 25 years.

        Since you've clearly been away, you might want to check out other "recent" developments such as the web.

    2. mistersaxon

      Re: can’t retrofit … metadata generation, storage and access to an existing file system

      IBM i (nee AS400) has had a fully object-oriented, strong object-typed file system with a relational database built in since it launched in 1988. Now whether it scales to billions of files is a different question but the whole "database for the OS" is not new.

      It also has a single address space for memory, disk and other file systems but that's a different matter, as is the virtualisation from hardware that's been baked in for generations.

  6. 27escape

    Also in the its not new camp

    I seem to recall that BeOS had this and being impressed at the time as to what a useful idea it was

  7. AbleBakerCharlie

    The article implies that QFS can instantly answer the original question about old MP3s, but this is not the case. While it can instantly tell you things like how many files are located beneath a directory, you still have to crawl the filesystem if you want specifics on individual files. It does this a good bit faster than Isilon, but it still takes a while at the scales mentioned.

    1. pgodman

      Hi! Thanks for the comment. I posted below a link to a blog about how Qumulo QF2 answers these types of questions statistically: https://qumulo.com/blog/getting-people-to-delete-data/. Appreciate the comments.

    2. This post has been deleted by its author

  8. I Am Spartacus

    Name confusion: Quantcast File System

    I saw QFS and thought of: Quantcast File System

    Different job, but still QFS.

  9. pgodman

    Some more details

    Hi everyone. Lots of great feedback in this thread. Yes, some file systems have used external (and internal) databases to manage metadata for a long time. The trick is how to do that in a way that delivers real file system performance but some level of database queryability. External indices are usually deal-killers due to multiple updates. Some folks build bolt-on systems to manage file system metadata, but it ends up being yet another thing to manage, scale, and program. Managing a great deal of metadata is a big problem for folks with billions or tens of billions of files, and by and large modern NAS lets them down.

    I wrote this blog post a while back that explains how we do this at Qumulo: https://qumulo.com/blog/getting-people-to-delete-data/. I think it'll answer some of your questions.

    I appreciate all your comments.

    Peter Godman

    1. Doctor Syntax Silver badge

      Re: Some more details

      Peter,

      It seems you're dealing with a particular use case. But I'd have thought that in that particular use case where you're dealing with digital assets of some value the initial approach should be to start with a database which not only provides the asset management but also stores the files themselves as blobs. A general purpose file system is just that - general purpose: it stores executable, configuration data, whole databases as one file (unless you take the approach of having the database access disk partitions), text documents, spreadsheets....etc. It is never going to be optimised for specific use cases and is never going to deal with the situation of "we've not touched this file for 2 years but it's still part of the final build of $VERY_VALUABLE_PRODUCT so even if we delete stuff unaccessed over a year we still keep this".

    2. Michael Wojcik Silver badge

      Re: Some more details

      Thanks for the link. The article is definitely light on technical detail, and the blog post, though it's specifically about data accumulation[1], is more useful.

      Obviously B-trees have been used in filesystems for a long time. Anyone with even a passing understanding of major contemporary filesystems knows B-trees are used by NTFS (just as they were used by HPFS before it). Pointing that out in a comment is not an interesting critique; Qumulo wouldn't exist if all they did was add a B-tree to a filesystem. So there's something more to see here (or it's entirely snake oil, but even a quick skim of the blog post shows that's not the case).

      I'm always interested to see some innovation in filesystem design, so I'll give the blog post a closer read.

      [1] Which of course has been a concern in computer science at least since Brillouin.

  10. LewisCowles1986

    Sorry but this seems to be a problem with logic. When you have a sufficiently large file-set you just chop it down by what makes sense and partition it. There is no need to access 1 billion files at once, it's nonsense from the mind of an imbecile. Look at how you're DB handles itself. Very few files, sometimes memory mapped. Copy that pattern.

    1. Doctor Syntax Silver badge

      "Look at how you're DB handles itself. Very few files"

      And preferably, at least in the Unix world, those few files are in /dev. After all, everything's a file.

    2. pgodman

      Whether you wish them to or not, many people have users that create file systems with billions of files, and then struggle with the management of said data. This happens frequently when thousands of individuals are collaborating on creating something. Imagine a movie, or an apartment building, or an FPGA, or the sequencing of thousands of genomes, or a scale-out file system.

      People also aggregate many workloads onto storage platforms for the sake of resource sharing. It allows for organizational flexibility and agility. It makes sense to do it, and the end result is hard to manage.

      As a case in point, check out how many files you have on your laptop. For me the number is a couple of million.

      As you point out, sometimes file storage is used for block storage workloads, like databases. That's a totally valid use, but a majority of all storage capacity in the world is not used for workloads that look like this.

      1. Doctor Syntax Silver badge

        "Whether you wish them to or not, many people have users that create file systems with billions of files, and then struggle with the management of said data."

        Database Management System. The clue's in the name.

        If you're going to create data on that scale you should work out the management scheme from the start or, at worst, as soon as you realise that you're creating more than you intended. When you do that you'll find that there's a choice of engines available off the shelf, commercial and FOSS. The database can hold the metadata that your application needs, which is probably going to be more than the file system provides and the data itself. If your data is in file system files anyone with the right (or wrong!) permissions can just delete some of them irrespective of the metadata saying, via business rules, that it should be kept.

        Example: upstream provides files containing up to 1,000 documents in XML format* to be printed on an industrial scale. Any XML file can contain different sorts of documents than require different base stock or printing hardware, some running to 10s of pages. S/W splits them up into the individual documents, transforms each into a form ready for the print formatter and stores it as a database blob. The batching engine gathers up documents of common properties (base stock and printer) ready for the print room operators. Only when a batch is selected for printing is a file system file generated and it contains 1,000 documents or whatever the batch parameters call for. Spoiled documents have their database record sent back for re-batching. Once despatched documents can be purged from the database. The system might have thousands of documents going through it at any one time but there are few actual bulk files involved and only at the input and output. Multiply that a few times for different contracts being handled.

        * Yes, technically the file is a document in XML terminology. As each file contains multiple documents in the application domain technology it's easier to stick with the latter and call the file an XML file.

        1. pgodman

          Yep, you're 100% right that there are many workflows that involve billions of assets that are absolutely best handled by a DBMS. The rest seems like a hasty generalization. There exist blue cars. Therefore all cars are blue.

          In the world of industrial creativity, innovation, or research, hundreds or thousands of people collaborate on data with machines. In that world people use file systems, and those file systems grow very large. Finite element analysis. Animation and special effects. Genomic sequencing and collaboration. Research imaging. Architecture and engineering. Chip design. Large scale software development, etc. etc. I met up with someone from the chip design space recently who had roughly 70 billion files under management. They don't do it because they haven't considered your approach. They do it because they've tried many things and this is the thing that works for them.

    3. pgodman

      See above response. The job of a file systems vendor isn't to tell people how to build an application. It's to create infrastructure that works for them. Lots of very smart people build enormous file systems with billions of files. They're not imbeciles. They just solve different problems than the ones than you work on.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon