In our previous article, we established that if rename(2) fails then we have to fall back to copying the file then removing the old one.

We copied the data by just reading blocks from one file and writing them to the new one.

For most files this would be sufficient, but for those that aren't this is a bad idea.

By reading and writing blocks the new file has exactly the same contents as the old one right?

Not quite. Files can have "holes", where there is no data, but reading that range returns blocks of zeroes, so a naïve copy will produce a file that has no holes.

This makes it take up more disk space, and other hole-aware software will treat it differently.

An example would be a tool for writing disk images to disks, it could make things quicker by only writing the data in the disk images but if you created a copy without holes it would write a whole bunch of zeroes that it didn't need to and reduce the life of the hard-drive.

To do this properly you need to either use FIEMAP, or SEEK_{DATA,HOLE}.

While FIEMAP is available in earlier kernels, since it exposes more information than is needed for copying a file and has historically been a source of disk corruption, we're going to proceed with SEEK_{DATA,HOLE}.

Copying sparsely with lseek(2)

The basis of the algorithm is to use lseek(2) with SEEK_HOLE to find the end of a block of data and copy it, then use lseek(2) with SEEK_DATA to find the end of the following hole.

The difficulty, as always, is in the details.

ssize_t copy_range(int srcfd, int tgtfd, size_t range);
ssize_t naive_contents_copy(int srcfd, int tgtfd);

ssize_t sparse_copy_contents(int srcfd, int tgtfd) {
    size_t copied = 0;
    off_t srcoffs = (off_t)-1;
    off_t nextoffs = (off_t)-1;

Finding the start

The first thing we need to do is to find out whether we started in a data block or a hole block.

To do that we need to know where "here" is though, the way to do this is to call lseek(fd, 0, SEEK_CUR), which logically means move the position forward 0 bytes, but has the side-effect of returning the current position.

    srcoffs = TEMP_FAILURE_RETRY(lseek(srcfd, 0, SEEK_CUR));
    if (srcoffs == (off_t)-1) {
        perror("Find current position of file");
        /* Can't seek file, could be file isn't seekable,
           or that the current offset would overflow. */
        return -1;
    }

Starting with data or a hole?

Now that we've got the current offset, we lseek(fd, offset, SEEK_DATA). If the returned value is the same as the provided offset, then it was a data block, but if it moved then we were in a hole and it returned the start of the data.

There's also the ever-present possibility that there is no more data, which sets errno(3) to ENXIO.

    nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_DATA));
    if (nextoffs == (off_t)-1) {
        if (errno == ENXIO) {
            /* NXIO means EOF, there is no data to copy,
               but we may need to make a hole to the end of the file */
            goto end_hole;
        }
        perror("Find data or hole at beginning of file");
        /* Error seeking, must not support sparse seek */
        return -1;
    }

    if (srcoffs != nextoffs)
        /* Seeked to the end of a hole, can skip a data copy. */
        goto hole;

Copying data and holes

Depending on whether we started in data or in a hole, we either copy the contents of the data, or use truncate(2) to extend the file without providing data.

Because truncate(2) does not advance the file offset, we have to use lseek(2) to do it manually.

As before, we can reach the end of the file when seeking, which breaks us out of the copy data then copy hole loop.

    for (;;) {
        ssize_t ret;
        /* In data, so we must find the end of the data then copy it,
           could pread/write. */
        nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_HOLE));
        if (nextoffs == (off_t)-1) {
            if (errno != ENXIO) {
                perror("Find end of data");
                return -1;
            }

            /* EOF after data, but we still need to copy */
            goto end_data;
        }

        srcoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_SET));
        if (srcoffs == (off_t)-1) {
            /* Rewinding failed, something is *very* strange. */
            perror("Rewind back to data");
            return -1;
        }

        ret = copy_range(srcfd, tgtfd, nextoffs - srcoffs);
        if (ret < 0) {
            return -1;
        }
        copied += ret;
        srcoffs = nextoffs;

        nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_DATA));
        if (nextoffs == (off_t)-1) {
            if (errno == ENXIO) {
                /* NXIO means EOF, there is no data to copy,
                   but we may need to make a hole to the end of the file */
                goto end_hole;
            }
            perror("Find end of hole");
            /* Error seeking, must not support sparse seek */
            return -1;
        }
hole:
        /* Is a hole, extend the file to the offset */
        ret = TEMP_FAILURE_RETRY(ftruncate(tgtfd, nextoffs));
        if (ret < 0) {
            perror("Truncate file to add hole");
            return -1;
        }

        /* Move file offset for target to after the newly added hole */
        nextoffs = TEMP_FAILURE_RETRY(lseek(tgtfd, nextoffs, SEEK_SET));
        if (nextoffs == (off_t)-1) {
            /* Something very strange happened,
               either some race condition changed the file,
               or the file is truncatable but not seekable
               or some external memory corruption,
               since EOVERFLOW can't happen with SEEK_SET */
            perror("Move to after newly added hole");
            return -1;
        }

        srcoffs = nextoffs;
    }

Filling it to the end

When finished with the copy-hole copy-data loop, we still have to fill the rest of the file, either by truncating it to fill in the final hole, or copying the rest of the data.

end_hole:
    nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, 0, SEEK_END));
    if (nextoffs == (off_t)-1) {
        perror("Seek to end of file");
        return -1;
    }
    if (srcoffs != nextoffs) {
        /* Not already at EOF, need to extend */
        int ret = TEMP_FAILURE_RETRY(ftruncate(tgtfd, nextoffs));
        if (ret < 0) {
            perror("Truncate to add hole at end of file");
            return -1;
        }
    }
    return copied;

end_data:
    {
        ssize_t ret = naive_contents_copy(srcfd, tgtfd);
        if (ret < 0)
            return ret;
        copied += ret;
    }
    return copied;
}

Integrating sparse copying into the program

Now that we've got our sparse_copy_contents, we need to amend our copy_file function to call it.

We can detect whether sparse copying is not possible by whether errno(3) gets set to EINVAL, so there's no harm in trying to sparsely copy a file first.

int copy_file(const char *source, const char *target, bool no_clobber) {
    int srcfd = -1;
    int tgtfd = -1;
    int ret = -1;
    srcfd = open(source, O_RDONLY);
    if (srcfd == -1) {
        perror("Open source file");
        return srcfd;
    }
    tgtfd = open(target, O_WRONLY|O_CREAT|(no_clobber ? O_EXCL : 0), 0600);
    if (tgtfd == -1) {
        perror("Open target file");
        return tgtfd;
    }

    ret = sparse_copy_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a sparse copy,
      so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    return naive_contents_copy(srcfd, tgtfd);
}

I have omitted the definitions of copy_range and naive_contents_copy since they are not relevant to handling holes, but full program listing may be downloaded from my-mv.c.

Conclusion

Now when we move a file we don't end up taking more space, we don't copy data that we don't need so it goes faster, and files that treat holes specially, like disk images, will behave properly.

But files that may have holes are also likely to contain a lot of data, and even only copying the data we need will take a while.

Can we make this as fast as rename(2)?

Posted Wed Aug 3 11:00:07 2016 Tags:
Daniel Silverstone Reduce, Reuse, Recycle

There is a "popular" theme among environmentalists to encourage everyone to reduce waste, reuse items rather than throwing them away, and recycle items to new purpose wherever possible. These three verbs can rather easily be applied to the world of software engineering, and the concepts have been the topic of a number of blog posts over time.


It's often a very hard thing for professional software engineers to say, but every line of code you produce is a cost not an asset. The easiest to maintain code is something which was never written since it can never go wrong. As such, simply reducing the quantity of code being written can be an effective way to improve the quality of software over all.

As programmers we have the concept of code-reuse drummed into us almost from the get-go. We're encouraged to use libraries which already exist, rather than reinventing the same thing again, and we're encouraged to abstract our code as much as is sensible to encourage its reuse within a project or even across projects. By reusing code already written, we don't end up duplicating code which might have unseen subtleties, corner cases, or bugs.

When a programmer writes a nice piece of code for solving a particular problem, it can sometimes be repurposed to solve another similar problem with very little effort. This doesn't necessarily mean that the code in question would react well to being abstracted out into a library, but by recycling known-good code we can get a little closer to the ideal of having the least possible number of lines of code without losing the advantage of code already having been written.


The programming is terrible blog has a wonderful article on this, which I recommend that you all read and internalise. It's called Write code that is easy to delete, not extend and while it isn't short, it's well worth it.

Posted Wed Aug 10 11:00:10 2016

We previously made our file copy sparse-aware, so it only copies the data rather than the holes between the data, which as well as being more correct is also a lot faster for files which happen to be sparse.

Files which are sparse also tend to be large, since they are usually some form of disk image, so while we are only copying the data there can still be a lot to copy so it would be convenient if we could to this more quickly.

rename(2) was fast, can we make copying the data faster?

That depends.

rename(2) and link(2) get to be fast because they don't copy any data, they just change the directory metadata while keeping the data the same.

Some filesystems have an extra level of indirection which lets multiple files share the same data though.

Cloning files with btrfs

btrfs and ZFS are filesystems which support multiple files sharing data.

Because of ZFS' interesting legal position I am more familiar with how btrfs operates so I'm not going to talk about ZFS.

If you want to copy a file between two btrfs file systems that happen to be stored on the same physical hard disks then you can "clone" the file's contents into the new file which is nearly as fast as a rename(2).

#include <linux/btrfs.h> /* BTRFS_IOC_CLONE */
#include <sys/vfs.h>     /* ftatfs, struct statfs */
#include <sys/stat.h>    /* struct stat */
#include <sys/ioctl.h>   /* ioctl */
#include <linux/magic.h> /* BTRFS_SUPER_MAGIC */

int btrfs_clone_contents(int srcfd, int tgtfd) {
    struct statfs stfs;
    struct stat st;
    int ret;

    /* Behaviour is undefined unless called on a btrfs file,
      so ensure we're calling on the right file first. */
    ret = fstatfs(tgtfd, &stfs);
    if (ret < 0)
        return ret;
    if (stfs.f_type != BTRFS_SUPER_MAGIC) {
        errno = EINVAL;
        return -1;
    }

    ret = fstat(tgtfd, &st);
    if (ret < 0)
        return ret;
    if (!S_ISREG(st.st_mode)) {
        errno = EINVAL;
        return -1;
    }

    return ioctl(tgtfd, BTRFS_IOC_CLONE, srcfd);
}

int copy_file(const char *source, const char *target, bool no_clobber) {
    int srcfd = -1;
    int tgtfd = -1;
    int ret = -1;
    srcfd = open(source, O_RDONLY);
    if (srcfd == -1) {
        perror("Open source file");
        return srcfd;
    }
    tgtfd = open(target, O_WRONLY|O_CREAT|(no_clobber ? O_EXCL : 0), 0600);
    if (tgtfd == -1) {
        perror("Open target file");
        return tgtfd;
    }

    ret = btrfs_clone_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a btrfs clone,
      so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    ret = sparse_copy_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a sparse copy,
      so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    return naive_contents_copy(srcfd, tgtfd);
}

What if I'm not using a filesystem that decouples files and their contents?

You probably don't get the benefit of being able to share contents so you are unlikely to be able to copy a file without duplicating its data.

It is however still possible to reduce the amount of effort involved.

You may have noticed that the general pattern is read the contents into memory then write the contents from memory into the new file.

This means that to copy a file you are copying the data twice, first out of the old file into your process' memory, then again into the new file. (The exact details are complicated. If you've not opened the file with O_DIRECT then the data may be cached so it's copied from kernel memory, but if it's not then it's read from disk into kernel memory first. When you write without O_DIRECT it queues up the data to be written to disk at some point soon in the near future.)

In some circumstances it may be possible to cut this down to one copy, straight from the source file into the target file.

There are a handful of system calls for copying data from one file into another.

Historically there has been sendfile(2) and splice(2) which copy data between two files without reading them into userspace first.

#include <sys/sendfile.h> /* sendfile */
ssize_t sendfile_copy_range(int srcfd, int tgtfd, size_t range) {
    size_t to_copy = range;
    while (to_copy) {
        ssize_t ret = sendfile(tgtfd, srcfd, NULL, range);
        if (ret < 0)
            return ret;
        to_copy -= ret;
    }
    return range;
}

ssize_t splice_copy_range(int srcfd, int tgtfd, size_t range) {
    size_t to_copy = range;
    while (to_copy) {
        ssize_t ret = splice(srcfd, NULL, tgtfd, NULL, range, 0);
        if (ret < 0)
            return ret;
        to_copy -= ret;
    }
    return range;
}

These were originally designed for speeding up web servers serving static files by copying data between files and pipes or sockets.

copy_file_range(2) was added to copy the contents of one file to another, after an a failed attempt to get it in under the name reflink, with the intention that it would be a filesystem independent way to share the contents of files like btrfs's clone ioctl.

#if !HAVE_DECL_COPY_FILE_RANGE

#ifndef __NR_copy_file_range
#  if defined(__x86_64__)
#    define __NR_copy_file_range 326
#  elif defined(__i386__)
#    define __NR_copy_file_range 377
#  endif
#endif

static inline int copy_file_range(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags) {
    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out, off_out, len, flags);
}
#endif

ssize_t cfr_copy_range(int srcfd, int tgtfd, size_t range) {
    size_t to_copy = range;
    while (to_copy) {
        ssize_t ret = copy_file_range(srcfd, NULL, tgtfd, NULL, range, 0);
        if (ret < 0)
            return ret;
        to_copy -= ret;
    }
    return range;
}

To make use of these different ways to copy a file more quickly we need a way to dispatch between them.

An errno of ENOSYS means that the system call isn't supported, so there's no value in ever calling that again, but EINVAL just means it's the wrong kind of file so you just need to fall back to a different method of copying.

ssize_t naive_copy_range(int srcfd, int tgtfd, size_t range) {
    char buf[4 * 1024 * 1024];
    size_t copied = 0;
    while (range > copied) {
        size_t to_copy = range - copied;
        ssize_t n_read;
        n_read = TEMP_FAILURE_RETRY(read(srcfd, buf,
                to_copy > sizeof(buf) ? sizeof(buf) : to_copy));
        if (n_read < 0) {
            perror("Read source file");
            return n_read;
        }
        if (n_read == 0)
            break;

        while (n_read > 0) {
            ssize_t n_written = TEMP_FAILURE_RETRY(write(tgtfd, buf, n_read));
            if (n_written < 0)
                perror("Write to target file");
                return n_written;

            n_read -= n_written;
            copied += n_written;
        }
    }
    return copied;
}

ssize_t copy_range(int srcfd, int tgtfd, size_t range) {
    ssize_t copied;
    static int have_cfr = true, have_sendfile = true, have_splice = true;

    if (have_cfr) {
        copied = cfr_copy_range(srcfd, tgtfd, range);
        if (copied >= 0) {
            return copied;
        } else if (errno == ENOSYS) {
            have_cfr = false;
        } else if (errno != EINVAL) {
            return copied;
        }
    }

    if (have_sendfile) {
        copied = sendfile_copy_range(srcfd, tgtfd, range);
        if (copied >= 0) {
            return copied;
        } else if (errno == ENOSYS) {
            have_sendfile = false;
        } else if (errno != EINVAL) {
            return copied;
        }
    }

    if (have_splice) {
        copied = splice_copy_range(srcfd, tgtfd, range);
        if (copied >= 0) {
            return copied;
        } else if (errno == ENOSYS) {
            have_splice = false;
        } else if (errno != EINVAL) {
            return copied;
        }
    }

    return naive_copy_range(srcfd, tgtfd, range);
}

For convenience of testing, the full my-mv.c source file and Makefile, including the new copy functions, can be downloaded.

Conclusion

So now we've got a slightly slower fall-back to when rename(2) fails right?

Well, not quite. We're copying the data as quickly as we can, but files also have associated metadata that we're not even thinking about yet.

Posted Wed Aug 17 11:00:06 2016 Tags:
Daniel Silverstone Workflow and tools

This week will be a little bit of a reprise of some things I've spoken about before on the topic of IDEs and some tangentially associated things.

When we write software, we are using what some of the more flowery among us would like to call a workflow but that most of us simply refer to as our tools. We like to think that we use best practice in our work too, and that the best practice actually is worth using.

As I described before, there are some integrated development environments which aim to fulfil all the needs we have from our tools and in doing so, they often fix a particular workflow into place. In addition, some tools are more powerful (or, rather, more flexible) than others and that can create friction when we try to combine techniques and tools to produce a workflow which suits us.

As you may have guessed by now, I am writing about this topic because I was recently made exceedingly angry by incomplete integration between some very popular tools which I tried to make work together. I was attempting to write some Java code and these days (apparently) the build system to use for Java apps is Gradle and since I'm not a "native" Java programmer, I went with the recommendation. I prepared my codebase, wrote my gradle file, built up a basic set of classes, got everything settled the way I wanted and then decided it was time to properly write some code.

Now, in Java world, it's exceedingly rare to write code without the assistance of an IDE because, quite simply, Java is incredibly verbose in some places and IDEs make life a lot easier for dealing with the moment-to-moment boilerplate and refactoring pain. One of the two major IDEs for Java is JetBrains' IntelliJ IDEA and IDEA has integration for Gradle projects. I fired up IDEA, pointed it at my build.gradle and it chugged away for a bit and then opened up a project workspace which looked perfect. Sadly it turned out to be entirely worthless because while IDEA could read the Gradle file, examine my filesystem and decide what sources comprised my program, it was actually doing this in a way which didn't take into account that Gradle is in fact able to embed arbitrary Java and I happened to be using a feature of that which IDEA simply had no way of implementing.

This resulted in a project which I could edit reasonably easily in IDEA, but which had to be built at the command line with gradle itself. If I attempted to build and run the project inside IDEA, the code simply wouldn't work. Since I had picked up this project on a whim, I'm sorry to say that I simply put it back down again, stepped away from the directory, and I doubt I'll be going back to it any time soon. I can work around IDEA's inability to handle the feature I was using; but it'll be awkward and I am too annoyed at a pair of supposedly integrated tools entirely failing to do-the-right-thing. Sadly turing-complete build systems are intrinsically hard to reproduce without simply using the build system itself.

The message I'm hoping you will take away from this little cautionary tale is that just because you're following what you think might be best practice and using tools which purport to function well together and support one another, you may find yourself hitting a brick wall within moments of wanting to actually produce some software. I expect I'll get back to this project of mine at some point, and so perhaps also take away that it's okay to walk away from something which annoys you, and you can always come back to it later, especially if you remembered to commit it and push it to your Git server.

Posted Wed Aug 24 11:00:07 2016

So we've copied everything from the file now right?

Well, we've copied all the data, and depending on your application that might be enough, but files also have metadata to worry about.

We can see this by example, by checking the output of ls -l, before and after moving the file to a different filesystem.

$ touch /run/user/$(id -u)/testfile
$ ls -l /run/user/$(id -u)/testfile
-rw-rw-r-- 1 richardmaw richardmaw 0 Aug  8 19:40 /run/user/1000/testfile
$ ./my-mv /run/user/$(id -u)/testfile testfile
$ ls -l testfile
-rw------- 1 richardmaw richardmaw 0 Aug  8 19:41 testfile

You should be able to see that the -rw-rw-r-- mode string, which represents readable for everyone and writable for the user and group, has become -rw-------, which represents read and write for the user only.

This is because ls(1) uses stat(2), which is returning different data for the file.

Setting mode

stat(2) provided the mode of the file, in the st_mode field.

chmod(2) can be used set mode of the new file.

The result of stat(2) isn't exactly the same format as chmod(2) takes, since in the stat(2) field it includes bits saying what type of file it is, but chmod(2) can't change what type a file is, so is only interested in the portion of the mode that is the permission bits.

int copy_contents(int srcfd, int tgtfd) {
    int ret = -1;
    ret = btrfs_clone_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a btrfs clone,
           so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    ret = sparse_copy_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a sparse copy,
           so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    return naive_contents_copy(srcfd, tgtfd);
}


int copy_file(char *source, char *target, bool no_clobber) {
    int srcfd = -1;
    int tgtfd = -1;
    int ret = -1;
    struct stat source_stat;

    ret = open(source, O_RDONLY);
    if (ret == -1) {
        perror("Open source file");
        goto cleanup;
    }
    srcfd = ret;

    ret = open(target, O_WRONLY|O_CREAT|(no_clobber ? O_EXCL : 0), 0600);
    if (ret == -1) {
        perror("Open target file");
        goto cleanup;
    }
    tgtfd = ret;

    ret = copy_contents(srcfd, tgtfd);
    if (ret < 0)
        goto cleanup;

    ret = fstat(srcfd, &source_stat);
    if (ret < 0)
        goto cleanup;

    ret = fchmod(tgtfd, source_stat.st_mode);
    if (ret < 0)
        goto cleanup;
cleanup:
    close(srcfd);
    close(tgtfd);
    return ret;
}

User and Group

User and Group are numeric IDs that ls(1) looks up in /etc/passwd and /etc/group to turn into a human readable name.

The chown(1) and chgrp(1) take a name, but the chown(2) system call does both using the numeric ID.

The user and group can be found in the stat(2) result in the st_uid and st_gid fields.

setgid bits

If the setgid bit is set then newly created files have the group of the directory rather than the user that created them, but if files are moved in, then they have the group they had before.

Depending on your application, it may make more sense to inherit the group or to preserve it from the original file.

enum setgid {
    SETGID_AUTO,
    SETGID_NEVER,
    SETGID_ALWAYS,
};


static int fix_owner(char *target, struct stat *source_stat, enum setgid setgid, int tgtfd) {
    struct stat target_stat;
    struct stat dirname_stat;
    char *target_dirname;
    int ret = 0;

    if (setgid == SETGID_NEVER)
        return fchown(tgtfd, source_stat->st_uid, source_stat->st_gid);

    ret = fstat(tgtfd, &target_stat);
    if (ret < 0) {
        perror("Stat target file");
        return ret;
    }

    target_dirname = dirname(target);
    ret = stat(target_dirname, &dirname_stat);
    if (ret < 0) {
        perror("Stat target directory");
        return ret;
    }

    if ((setgid == SETGID_ALWAYS
         || (setgid == SETGID_AUTO && dirname_stat.st_gid & S_ISGID))
        && target_stat.st_gid != dirname_stat.st_gid) {
        ret = fchown(tgtfd, target_stat.st_uid, dirname_stat.st_gid);
        if (ret < 0)
            perror("Chown target");
    }

    return ret;
}

static int fix_rename_owner(char *target, struct stat *source_stat, enum setgid setgid) {
    int tgtfd = -1;
    int ret = -1;

    ret = open(target, O_RDWR);
    if (ret == -1) {
        perror("Open target file");
        goto cleanup;
    }
    tgtfd = ret;

    ret = fix_owner(target, source_stat, setgid, tgtfd);
cleanup:
    close(tgtfd);
    return ret;
}

int move_file(char *source, char *target, bool no_clobber, enum setgid setgid) {
    int ret;
    struct stat source_stat;
    bool have_source_stat = false;
    if (setgid == SETGID_NEVER) {
        ret = stat(source, &source_stat);
        if (ret < 0)
            return ret;
        have_source_stat = true;
    }

    ret = renameat2(AT_FDCWD, source, AT_FDCWD, target, no_clobber ? RENAME_NOREPLACE : 0);
    if (ret == 0)
        return fix_rename_owner(target, &source_stat, setgid);
    if (errno == EXDEV)
        goto xdev;
    if (errno != ENOSYS) {
        perror("rename2");
        return ret;
    }
    /* Have to skip to copy if unimplemented since rename can't detect EEXIST */
    if (no_clobber)
        goto xdev;
rename:
    ret = rename(source, target);
    if (ret == 0)
        return fix_rename_owner(target, &source_stat, setgid);
    if (errno == EXDEV)
        goto xdev;
    perror("rename");
    return ret;
xdev:
    if (!have_source_stat) {
        ret = stat(source, &source_stat);
        if (ret < 0)
            return ret;
    }

    ret = copy_file(source, target, &source_stat, no_clobber, setgid);
    if (ret != 0)
        return ret;
    ret = unlink(source);
    if (ret < 0)
        perror("unlink");
}

Modification time

mtime and atime are the "last modification time" and "last access time".

This has classically been set with the utimes(2) system call, but this does not support nanosecond precision, so the futimens(2) system call is used.

This takes a pair of struct timespecs, and the times from the stat(2) result can be retrieved in struct timespec format in the st_atim and st_mtim fields.

int copy_file(char *source, char *target, struct stat *source_stat, bool no_clobber, enum setgid setgid) {
    int srcfd = -1;
    int tgtfd = -1;
    int ret = -1;

    ret = open(source, O_RDONLY);
    if (ret == -1) {
        perror("Open source file");
        goto cleanup;
    }
    srcfd = ret;

    ret = open(target, O_WRONLY|O_CREAT|(no_clobber ? O_EXCL : 0), 0600);
    if (ret == -1) {
        perror("Open target file");
        goto cleanup;
    }
    tgtfd = ret;

    ret = copy_contents(srcfd, tgtfd);
    if (ret < 0)
        goto cleanup;

    ret = fchmod(tgtfd, source_stat->st_mode);
    if (ret < 0)
        goto cleanup;

    ret = fix_owner(target, source_stat, setgid, tgtfd);
    if (ret < 0)
        goto cleanup;

    {
        struct timespec times[] = { source_stat->st_atim, source_stat->st_mtim, };
        ret = futimens(tgtfd, times);
        if (ret < 0)
            goto cleanup;
    }
cleanup:
    close(srcfd);
    close(tgtfd);
    return ret;
}

For convenience of testing, the full my-mv.c source file and Makefile, including the new copy functions, can be downloaded.

Unfixable data

Link count

The stat data returns how many other directory entries point to the same file in the st_nlink field.

We could only copy this correctly by making the same number of links, but this is unlikely to matter, and can't be fixed, unless we're copying a whole directory tree.

Creation/change time

There's another time in the stat(2) result, ctime. This is an unchangeable last changed time. It can only be set to an approximate value, by changing the system clock and modifying the file.

This is not worth the effort, as it requires elevated privileges and can cause problems for other programs.

Device and inode

There's two other fields called st_dev and st_ino, which identify which filesystem and file on that filesystem the file is.

It doesn't tell you much, other than whether the file is the same as another, which can be used to detect whether you would accidentally trash a file if you were to copy the contents of one file into another, or in the case of tar(1), whether a file were replaced in between it being created and its metadata being updated.

st_dev on its own has also classically been used to determine whether two files are on the same filesystem, but btrfs can provide different st_dev values for file on the same filesystem, but in different subvolumes, and bind-mounts may have the same st_dev for logically different mounts.

The btrfs weirdness can be solved by using statfs(2) to determine whether the files are on btrfs, and using BTRFS_IOC_FS_INFO and BTRFS_IOC_DEV_INFO to find out which block device the filesystem was mounted from then stat(2) on the device node to find its st_dev.

The bind-mounts can be solved by getting the mount ID, either using name_to_handle_at(2), or opening the file and reading /proc/self/fdinfo/$fd to read the mnt_id field, and comparing the mnd_id of the two files.

So we've made stat as similar as we can, that's all the metadata right?

Not quite, there's some less common metadata to apply.

Posted Wed Aug 31 11:00:07 2016 Tags: