We previously made our file copy sparse-aware, so it only copies the data rather than the holes between the data, which as well as being more correct is also a lot faster for files which happen to be sparse.

Files which are sparse also tend to be large, since they are usually some form of disk image, so while we are only copying the data there can still be a lot to copy so it would be convenient if we could to this more quickly.

rename(2) was fast, can we make copying the data faster?

That depends.

rename(2) and link(2) get to be fast because they don't copy any data, they just change the directory metadata while keeping the data the same.

Some filesystems have an extra level of indirection which lets multiple files share the same data though.

Cloning files with btrfs

btrfs and ZFS are filesystems which support multiple files sharing data.

Because of ZFS' interesting legal position I am more familiar with how btrfs operates so I'm not going to talk about ZFS.

If you want to copy a file between two btrfs file systems that happen to be stored on the same physical hard disks then you can "clone" the file's contents into the new file which is nearly as fast as a rename(2).

#include <linux/btrfs.h> /* BTRFS_IOC_CLONE */
#include <sys/vfs.h>     /* ftatfs, struct statfs */
#include <sys/stat.h>    /* struct stat */
#include <sys/ioctl.h>   /* ioctl */
#include <linux/magic.h> /* BTRFS_SUPER_MAGIC */

int btrfs_clone_contents(int srcfd, int tgtfd) {
    struct statfs stfs;
    struct stat st;
    int ret;

    /* Behaviour is undefined unless called on a btrfs file,
      so ensure we're calling on the right file first. */
    ret = fstatfs(tgtfd, &stfs);
    if (ret < 0)
        return ret;
    if (stfs.f_type != BTRFS_SUPER_MAGIC) {
        errno = EINVAL;
        return -1;
    }

    ret = fstat(tgtfd, &st);
    if (ret < 0)
        return ret;
    if (!S_ISREG(st.st_mode)) {
        errno = EINVAL;
        return -1;
    }

    return ioctl(tgtfd, BTRFS_IOC_CLONE, srcfd);
}

int copy_file(const char *source, const char *target, bool no_clobber) {
    int srcfd = -1;
    int tgtfd = -1;
    int ret = -1;
    srcfd = open(source, O_RDONLY);
    if (srcfd == -1) {
        perror("Open source file");
        return srcfd;
    }
    tgtfd = open(target, O_WRONLY|O_CREAT|(no_clobber ? O_EXCL : 0), 0600);
    if (tgtfd == -1) {
        perror("Open target file");
        return tgtfd;
    }

    ret = btrfs_clone_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a btrfs clone,
      so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    ret = sparse_copy_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a sparse copy,
      so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    return naive_contents_copy(srcfd, tgtfd);
}

What if I'm not using a filesystem that decouples files and their contents?

You probably don't get the benefit of being able to share contents so you are unlikely to be able to copy a file without duplicating its data.

It is however still possible to reduce the amount of effort involved.

You may have noticed that the general pattern is read the contents into memory then write the contents from memory into the new file.

This means that to copy a file you are copying the data twice, first out of the old file into your process' memory, then again into the new file. (The exact details are complicated. If you've not opened the file with O_DIRECT then the data may be cached so it's copied from kernel memory, but if it's not then it's read from disk into kernel memory first. When you write without O_DIRECT it queues up the data to be written to disk at some point soon in the near future.)

In some circumstances it may be possible to cut this down to one copy, straight from the source file into the target file.

There are a handful of system calls for copying data from one file into another.

Historically there has been sendfile(2) and splice(2) which copy data between two files without reading them into userspace first.

#include <sys/sendfile.h> /* sendfile */
ssize_t sendfile_copy_range(int srcfd, int tgtfd, size_t range) {
    size_t to_copy = range;
    while (to_copy) {
        ssize_t ret = sendfile(tgtfd, srcfd, NULL, range);
        if (ret < 0)
            return ret;
        to_copy -= ret;
    }
    return range;
}

ssize_t splice_copy_range(int srcfd, int tgtfd, size_t range) {
    size_t to_copy = range;
    while (to_copy) {
        ssize_t ret = splice(srcfd, NULL, tgtfd, NULL, range, 0);
        if (ret < 0)
            return ret;
        to_copy -= ret;
    }
    return range;
}

These were originally designed for speeding up web servers serving static files by copying data between files and pipes or sockets.

copy_file_range(2) was added to copy the contents of one file to another, after an a failed attempt to get it in under the name reflink, with the intention that it would be a filesystem independent way to share the contents of files like btrfs's clone ioctl.

#if !HAVE_DECL_COPY_FILE_RANGE

#ifndef __NR_copy_file_range
#  if defined(__x86_64__)
#    define __NR_copy_file_range 326
#  elif defined(__i386__)
#    define __NR_copy_file_range 377
#  endif
#endif

static inline int copy_file_range(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags) {
    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out, off_out, len, flags);
}
#endif

ssize_t cfr_copy_range(int srcfd, int tgtfd, size_t range) {
    size_t to_copy = range;
    while (to_copy) {
        ssize_t ret = copy_file_range(srcfd, NULL, tgtfd, NULL, range, 0);
        if (ret < 0)
            return ret;
        to_copy -= ret;
    }
    return range;
}

To make use of these different ways to copy a file more quickly we need a way to dispatch between them.

An errno of ENOSYS means that the system call isn't supported, so there's no value in ever calling that again, but EINVAL just means it's the wrong kind of file so you just need to fall back to a different method of copying.

ssize_t naive_copy_range(int srcfd, int tgtfd, size_t range) {
    char buf[4 * 1024 * 1024];
    size_t copied = 0;
    while (range > copied) {
        size_t to_copy = range - copied;
        ssize_t n_read;
        n_read = TEMP_FAILURE_RETRY(read(srcfd, buf,
                to_copy > sizeof(buf) ? sizeof(buf) : to_copy));
        if (n_read < 0) {
            perror("Read source file");
            return n_read;
        }
        if (n_read == 0)
            break;

        while (n_read > 0) {
            ssize_t n_written = TEMP_FAILURE_RETRY(write(tgtfd, buf, n_read));
            if (n_written < 0)
                perror("Write to target file");
                return n_written;

            n_read -= n_written;
            copied += n_written;
        }
    }
    return copied;
}

ssize_t copy_range(int srcfd, int tgtfd, size_t range) {
    ssize_t copied;
    static int have_cfr = true, have_sendfile = true, have_splice = true;

    if (have_cfr) {
        copied = cfr_copy_range(srcfd, tgtfd, range);
        if (copied >= 0) {
            return copied;
        } else if (errno == ENOSYS) {
            have_cfr = false;
        } else if (errno != EINVAL) {
            return copied;
        }
    }

    if (have_sendfile) {
        copied = sendfile_copy_range(srcfd, tgtfd, range);
        if (copied >= 0) {
            return copied;
        } else if (errno == ENOSYS) {
            have_sendfile = false;
        } else if (errno != EINVAL) {
            return copied;
        }
    }

    if (have_splice) {
        copied = splice_copy_range(srcfd, tgtfd, range);
        if (copied >= 0) {
            return copied;
        } else if (errno == ENOSYS) {
            have_splice = false;
        } else if (errno != EINVAL) {
            return copied;
        }
    }

    return naive_copy_range(srcfd, tgtfd, range);
}

For convenience of testing, the full my-mv.c source file and Makefile, including the new copy functions, can be downloaded.

Conclusion

So now we've got a slightly slower fall-back to when rename(2) fails right?

Well, not quite. We're copying the data as quickly as we can, but files also have associated metadata that we're not even thinking about yet.

Pretty cool stuff!

Two things occur to me though, the first is that according to my man page, for splice(2) to be applicable one of the fds has to be a pipe, which probably rules it out for file copying. The second is that none of these calls seem to be standard, they're linux specific and my man pages here suggest that prototypes and semantics differ across platforms such that these calls can't be used portably. This makes me wonder whether ENOSYS is really useful here, since if we write code using these calls then we're already presumably tied to a particular platform (linux), so we can safely assume these calls are implemented?

Comment by Gravious Sun Feb 26 18:55:58 2017

Yeah, splice is included because at the time I had the notion of writing a generic copy routine between two file descriptors.

We're assuming Linux, but with some flexibility of which version.

  • sendfile was introduced in Linux 2.2, but the output file descriptor must be a socket. Since 2.6.33 it may be any file.
  • splice was introduced in Linux 2.6.17 and might some day support any file descriptor.
  • copy_file_range was introduced in Linux 4.5

Checking versions would be inappropriate since you see frankenkernels where newer features have been back-ported, and sometimes system calls are optional, depending on kernel configuration, such as name_to_handle_at.

Comment by Richard Maw Mon Aug 21 10:13:11 2017