back to article How do you copy 60m files?

Recently I copied 60 million files from one Windows file server to another. Tools used to move files from system to another are integrated into every operating system, and there are third party options too. The tasks they perform are so common that we tend to ignore their limits. Many systems administrators are guilty of not …

COMMENTS

This topic is closed for new posts.

Page:

  1. Anonymous Coward
    Anonymous Coward

    cygwin + bash + rsync

    You could have setup cygwin with bash and rsync on the windows machines as well. Rsync is better then cp when it comes to moving files between systems.

    1. Anonymous Coward
      Pint

      Right ...

      So you've tried this and it works with 60 million files have you?

      1. Anonymous Coward
        Troll

        Re: Right

        "So you've tried this and it works with 60 million files have you?"

        So you've tried this with 60 million files and it hasn't? Have you?

        1. Anonymous Coward
          Pint

          Nope

          Nope, not tried it, that's why I'm not suggesting it as a solution ... What's your reason?

      2. Joe Montana
        WTF?

        So you've tried this and it works with 60 million files have you?

        I have used rsync extensively, one of the things i commonly use it for is duplicating entire system installs from one system to another... I can do an initial copy while the source system is live to get 99% of the data, and then only need to copy the differences once i have shut down the source system. I have a busy mailserver which uses the maildir format (one file per message) and that had more than 60 million small files on it when i migrated it to new hardware.

        1. Anonymous Coward
          Pint

          You're missing the important point

          You've tried this in cygwin on a windows box have you? And it works?

          No'one is disputing rsync works on linux. The article is about how to copy that many files from a windows box. If someone suggests doing it in cygwin it's a more useful suggestion if they actually know whether it works, rather than leaving someone else to do their work for them.

          In theory any of the other tools discussed n the article should also work. In practice they didn't.

  2. Trygve Henriksen

    Robocopy...

    It has options for 'synchronising' two folders, including file permissions.

    Dump the text output into a file then search for 'ERROR', or better, pass it through FINDSTR to filter out anything but lines beginning with 'ERROR'

    Sure, it bombs out om folder/filenames longer than 256 characters, but those are an abomination and the users who made them(usually by using filenames suggested by MS Office apps) really needs to be punished anyway.

    Been pushing around a few million files this spring and summer...

    1. Steven Hunter
      FAIL

      Wrong...

      Robocopy DOES support paths longer than 256 characters... In fact there's a flag ( /256 ) that *disables* "very long path (> 256 characters) support".

    2. Velv

      robocopy /create

      robocopy also has the /create option which copies the files with 0 data. OK, a total operation take 2+ passes, but it has several advantages :

      1. 60million files WILL cause the MFT to expand. If you are filling the disk with data as well as MFT then you can end up fragenting the MFT which can lead to performance problems. If the disk is only writing MFT (such as during a create), then the MFT can expand to adjacent space.

      2. Since there is no data, the operation completes is a fraction of the time, so if you log the operation, you can see where any failures will occur, and fix them for the final copy.

      For planned migrations, you can run the same job several times (only run the /create the first time), and therefore the final sync should take a lot less time :)

  3. The BigYin

    Or...

    ...use Cygwin? Gives a lot of the *nix style commands and abilities right within Windows.

    If one is going to go to the bother of having a Linux box sitting about just to show the Windows severs how it's done, one has to question why one even has the Windows serves in the first place. :-)

    1. Anonymous Coward
      Pint

      Hmmm ...

      Perhaps it's because the windows servers aren't used simply to copy files?

      And you've tried copying 60 million files in cygwin have you? How did that go?

      1. The BigYin
        Flame

        Get a grip

        I asked is "cygwin" had been considered, I did not say "Use cygwin! It's the wins! L0LZ!!11!" The author had looked at various tools and not mentioned "cygwin", so my question seems perfectly reasonable to me (others have asked the same question).

        You seem to have taken umbrage at a few "cygwin" related posts, do you have an issue with this tool (I used it for some light-weight ssh and rsync work and really like it). If you do, have you filled bugs, got involved?

        Or do you know something about "cygwin" and large jobs? "Oh, you can't use cygwin for that because it's job index will overflow, see bug-1234".

        Or are you just some reactionary pillock who can't see through the Windows? "!Microsoft==Bad"

        I know which conclusion I am drawing at the moment...

        1. Anonymous Coward
          Pint

          Right

          You're jumping to the wrong conclusions. The point is simply that before trying you'd expect half the tools tried in the article to work on windows. However, they didn't. So it's helpful to make clear whether you know the solution you're suggesting works or not. It sounds like you don't. Also, you didn't ask whether cygwin had been considered. Re-read what you posted. It's nothing to do with windows / linux / os of choice. I work mostly on bsd and linux derivatives for a living. Oh, and i like cygwin.

    2. Daniel B.
      Boffin

      SFU

      Easier: Use Services For Unix, that one uses an Interix subsystem to run all your UNIXy stuff.

  4. Russell Howe

    Fastest way?

    Pull out hard drive.

    Move hard drive to other server.

    Put in hard drive.

    Or backup to tape and then restore.

    Using something like tar or rsync may well have been better than cp

    1. Anonymous Coward
      Anonymous Coward

      The title is required, and must contain letters and/or digits.

      All well and good until you have a RAID array

      1. Anonymous Coward
        Thumb Up

        Agreed.

        And you do know the price of a tape drive, right? Company I worked for bought a pair of Ultrium-4 drives from IBM. The bloody thing itself costs a little over US$3000 a pop, and you need two. If you work in a company where accounts is a /b/tard, you'll know how painful it is to get them to approve the upgrade for a drive, let alone two drives.

    2. Pete 2 Silver badge

      and some added bonuses ...

      +1 This has the most desirable side effect of stopping any bugger from trying to update the "wrong" file, or creating new ones while the copy is in progress.

      p.s. Don't forget the second part of any professional data copying activity is to VERIFY that what you copied actually did turn out to be the same as what you copied from. Many a backup has turned out to be just a blank tape and an error message without this stage.

    3. Trevor_Pott Gold badge

      @Russell Howe

      Did I mention that the systems were both live and in use during the copy?

      Also, RAID card on the destination server was full.

    4. Velv
      FAIL

      , err, how do you do that with a SAN or NAS

      Clearly you've never worked in a decent sized enterprise.

      There are LOTS of scenario's when you can't just move the disks or use backup hardware. Maybe not all the data on one disk is being moved. Maybe the data is on different SANs or NAS and can't be swapped or zoned. maybe you can't attach the same type of tape to each server.

  5. Anonymous Coward
    Happy

    Most obvious conclusion ever

    "So the best way to move 60 million files from one Windows server to another turns out to be: use Linux."

    D'oh? On a serious note, you should also give rsync a go. And try also _storing_ the data on an OS that is not windows, you might find that you are then able to do things you need to do, like say copy 60m files, without resorting to booting virtual machines of another OS just to use that OS's utilities to manage your main OS. Just saying. Please don't flame me.

  6. zaax
    Thumb Up

    xcopy

    how about a straght cmd line like xcopy?

  7. Anomalous Cowturd
    Linux

    You're learning, Trevor!

    > So the best way to move 60 million files from one Windows server to another turns out to be: use Linux.

    It's only a matter of time before you see the light!

    Tux. Obviously.

  8. ZapB
    Pint

    Easy

    screen -R big-copy

    rsync -avzPe ssh user1@box1:/src-path user2@box2:/dst-path

    <Ctrl>-<a>-<d>

    Go for a beer or 500 ;-)

    1. Ammaross Danan
      FAIL

      Wrong

      Not easy once you try doing a file transfer via rsync through an ssh tunnel, like your suggesting, but the destination server isn't running an ssh server....let alone use / as a path convention.

      1. Nigel 11
        Go

        Wrong ... you mean, it's running Windoze.

        "Not easy once you try doing a file transfer via rsync through an ssh tunnel, like your suggesting, but the destination server isn't running an ssh server....let alone use / as a path convention."

        Well, if the target system is running LInux, then turn on its sshd service!

        If it's running Windoze ... well, borrow another PC, boot an appropriate Linux live CD, mount the MS Shared folder as CIFS, start the ssh daemon, and rsync through the temporary Linux system to the MS system.

        Linux: the system that provides answers and encourages creativity.

        Windoze: the system that erects obstacles and encourages stupidity.

  9. Psymon

    large file transfers are always a challenge

    Personally, I swear by Directory Opus, by GPsoftware.

    I can attest to it's incredibly reliable performance, error handling and insanely flexible advanced features.

    Aside from being able to copy vast quantities of data, handle errors, log all actions, migrate NTFS properties, automatically unprotect restricted files and re-copy files if the source is modified, it also has built-in FTP, an advanced synchronisation feature (useful for mopping up failed files after you've fixed the problem that stopped them being copied), and a truly unparralelled batch renaming system which among other things, can use Regular Expressions.

    It also has tabbed browsing (you can save groups of tabs), duplicate file finding, built in Zip management, custom toolbar command creation, file/folder listing and printing....

    Stangely, not a lot of sysadmins know about DOpus. I learnt of it during my Amiga days, in what seems like a lifetime ago. I always have a copy installed on my workstation, and at least 1 of my servers

    1. Alan Bourke
      Unhappy

      Directory Opus is great ...

      ... but I doubt it would copy 60 million files.

  10. blah 5

    I second (third? fourth?) the vote for cygwin.

    Rsync is your friend, especially when you've got the ssh tools loaded so you can use scp. :)

    Come to the light, Trevor! :)

  11. Gary F
    Thumb Up

    Good story - quite an eye opener

    I like it! I'm often frustrated dealing with 10,000 small files in Windows, never mind 60m! On a desktop Windows 7 is painfully slow displaying 2000 photos in a single folder. It shows them straight away (detailed view, not thumbs) but then takes 20 secs to sort them by date modified! Aah!

    But why do you have 60m files? Could you store that data in a better way? Could it be put into a database for example?

    1. Anonymous Coward
      Anonymous Coward

      why have 2000 in a single folder

      surely you could come up with some kind of organisation, holidays, family etc. The file system is hierarchical for a reason!

      are you one of those people who has hundreds of files dropped directly into the root of c: too?

      1. Anonymous Coward
        Anonymous Coward

        desktop

        Or cluttered the desktop with icons and icons and more icons?

      2. Anonymous Coward
        Anonymous Coward

        Why? Because.

        Are you one of those people who wastes your life endlessly taxonomising?

        And hierarchical taxonomies are largely at odds with reality. Here is a photo of an interesting building I saw on holiday. Does it go in the "building" folder, or the "holiday" folder?

        1. Ezekiel Hendrickson
          Boffin

          Reply to post: Why? Because.

          > Here is a photo of an interesting building I saw on holiday.

          > Does it go in the "building" folder, or the "holiday" folder?

          Nah. It goes in the 2010-08-24 folder, tagged with 'holiday' and 'architecture'

      3. Annihilator
        Alert

        2000 in one folder

        Because 2000 can quite easily fit on one memory card these days, and a long enough holiday can also generate that many.

    2. Ammaross Danan
      FAIL

      Database?

      Store 60m files in a database? SharePoint perhaps? Not quite as easy to access/control/backup as an NTFS storage tree. Sorry.

  12. Anonymous Coward
    Boffin

    Hmm

    From Windows to Windows I'd use robocopy.

    For Unix to Unix use rsync.

    Both of these are very fast and support all sorts of failures where a copy is restarted (the existing files are skipped).

    1. Trevor_Pott Gold badge

      @AC

      From the article:

      I wanted to give several command-line tools a go as well. XCopy and Robocopy most likely would have been able to handle the file volume but - like Windows Explorer - they are bound by the fact that NTFS can store files with longer names and greater path than CMD can handle. I tried ever more complicated batch files, with various loops in them, in an attempt to deal with the path depth issues. I failed.

      1. Anonymous Coward
        Anonymous Coward

        I've used Robocopy for 60m files

        Many times, it is fast and the /MIR option is awesome since a failure for any bizarre under the hood reason can be left and corrected at the end. Also if users are updating files whilst you copy, the final run picks up all those changes.

      2. frymaster

        robocopy can handle large filenames

        ....which makes me very interested in what he was doing, and why robocopy wasn't an option

        I've managed to use robocopy to create files I couldn't delete from windows before (because I'd gone from c:\ to d:\somefolder\someotherfolder and that pushed the bottom of the folders past the filepath limit)

  13. Colin Miller

    rsync

    I'd be tempted to use rsync - its fallover behaviour is more reliable that cp's.

    It can copy over ssh (or rsh), its own network protocol, as well as between local directories (or NFS / Samba mounted shares).

  14. Tone
    FAIL

    Robocopy

    Ability to copy file and folder names exceeding 256 characters — up to a theoretical limit of 32,000 characters — without errors

  15. Anonymous Coward
    Anonymous Coward

    Its all about preferences?

    These things tend to end up as personal preferences so whether Linux or Windows my favourite tool is rsync and if I'm copying more than a single file within a Linux box I choose it over using cp any day.

  16. mego

    Richcopy

    I use a tool called Richcopy. It can do multiple thread copying (copying a couple files at a time, supports x number of retries, and gives you a handy readout at the end on what files failed/had an issue/etc. It also has a compare feature to compare the two locations once you're done.

    1. Daniel B.
      Happy

      Yes!

      You did read the article, did you? Richcopy is not only mentioned, it was the only tool that worked under Windows for Trevor.

  17. slack

    LOL

    "Richcopy can handle large quantities of files, but can multi-thread the copy, and so is several hours faster than using a Linux server as an intermediary"

    Richcopy is not so fast that there is time to defragment an NTFS partition with 60 million files on it before CP would have finished."

    Wut?

    1. Trevor_Pott Gold badge

      @Slack

      Richcopy copies more than one file a time.

      If your destination is a Windows server, then doing this causes massive fragmentation. So the total time to finish the copy is "time to copy" +"time to defrag". The goal is not just to get files from server A to server B, but to get them there in a ashion that ensures that server B is ready for prime time.

    2. Adam Williamson 1

      pretty simple

      pretty simple. richcopy's multi-threading approach is faster but results in a massively fragmented filesystem which you then have to defrag. So in total, the richcopy route is slower, because richcopy+defrag takes longer than cp.

  18. Paul 25

    One word...

    rsync

    I don't know what the state of rsync servers on windows is, but on unix systems it's the one true way for copying files over a network.

    It's fast, handles failure well, doesn't get it's nickers in a twist when doing large recursive copies, and will run over ssl with a bit of work. It can also give you plenty of feedback about progress if you need it.

    The idea of copying that many files using something like a file manager or cp, no matter how good, just fills me with horror.

  19. kevin biswas
    Thumb Up

    Roadkil's Unstoppable Copier

    Is great too.

    http://www.roadkil.net/program.php?ProgramID=29

Page:

This topic is closed for new posts.