In our previous article, we established that if rename(2) fails then we have to fall back to copying the file then removing the old one.

We copied the data by just reading blocks from one file and writing them to the new one.

For most files this would be sufficient, but for those that aren't this is a bad idea.

By reading and writing blocks the new file has exactly the same contents as the old one right?

Not quite. Files can have "holes", where there is no data, but reading that range returns blocks of zeroes, so a naïve copy will produce a file that has no holes.

This makes it take up more disk space, and other hole-aware software will treat it differently.

An example would be a tool for writing disk images to disks, it could make things quicker by only writing the data in the disk images but if you created a copy without holes it would write a whole bunch of zeroes that it didn't need to and reduce the life of the hard-drive.

To do this properly you need to either use FIEMAP, or SEEK_{DATA,HOLE}.

While FIEMAP is available in earlier kernels, since it exposes more information than is needed for copying a file and has historically been a source of disk corruption, we're going to proceed with SEEK_{DATA,HOLE}.

Copying sparsely with lseek(2)

The basis of the algorithm is to use lseek(2) with SEEK_HOLE to find the end of a block of data and copy it, then use lseek(2) with SEEK_DATA to find the end of the following hole.

The difficulty, as always, is in the details.

ssize_t copy_range(int srcfd, int tgtfd, size_t range);
ssize_t naive_contents_copy(int srcfd, int tgtfd);

ssize_t sparse_copy_contents(int srcfd, int tgtfd) {
    size_t copied = 0;
    off_t srcoffs = (off_t)-1;
    off_t nextoffs = (off_t)-1;

Finding the start

The first thing we need to do is to find out whether we started in a data block or a hole block.

To do that we need to know where "here" is though, the way to do this is to call lseek(fd, 0, SEEK_CUR), which logically means move the position forward 0 bytes, but has the side-effect of returning the current position.

    srcoffs = TEMP_FAILURE_RETRY(lseek(srcfd, 0, SEEK_CUR));
    if (srcoffs == (off_t)-1) {
        perror("Find current position of file");
        /* Can't seek file, could be file isn't seekable,
           or that the current offset would overflow. */
        return -1;
    }

Starting with data or a hole?

Now that we've got the current offset, we lseek(fd, offset, SEEK_DATA). If the returned value is the same as the provided offset, then it was a data block, but if it moved then we were in a hole and it returned the start of the data.

There's also the ever-present possibility that there is no more data, which sets errno(3) to ENXIO.

    nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_DATA));
    if (nextoffs == (off_t)-1) {
        if (errno == ENXIO) {
            /* NXIO means EOF, there is no data to copy,
               but we may need to make a hole to the end of the file */
            goto end_hole;
        }
        perror("Find data or hole at beginning of file");
        /* Error seeking, must not support sparse seek */
        return -1;
    }

    if (srcoffs != nextoffs)
        /* Seeked to the end of a hole, can skip a data copy. */
        goto hole;

Copying data and holes

Depending on whether we started in data or in a hole, we either copy the contents of the data, or use truncate(2) to extend the file without providing data.

Because truncate(2) does not advance the file offset, we have to use lseek(2) to do it manually.

As before, we can reach the end of the file when seeking, which breaks us out of the copy data then copy hole loop.

    for (;;) {
        ssize_t ret;
        /* In data, so we must find the end of the data then copy it,
           could pread/write. */
        nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_HOLE));
        if (nextoffs == (off_t)-1) {
            if (errno != ENXIO) {
                perror("Find end of data");
                return -1;
            }

            /* EOF after data, but we still need to copy */
            goto end_data;
        }

        srcoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_SET));
        if (srcoffs == (off_t)-1) {
            /* Rewinding failed, something is *very* strange. */
            perror("Rewind back to data");
            return -1;
        }

        ret = copy_range(srcfd, tgtfd, nextoffs - srcoffs);
        if (ret < 0) {
            return -1;
        }
        copied += ret;
        srcoffs = nextoffs;

        nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, srcoffs, SEEK_DATA));
        if (nextoffs == (off_t)-1) {
            if (errno == ENXIO) {
                /* NXIO means EOF, there is no data to copy,
                   but we may need to make a hole to the end of the file */
                goto end_hole;
            }
            perror("Find end of hole");
            /* Error seeking, must not support sparse seek */
            return -1;
        }
hole:
        /* Is a hole, extend the file to the offset */
        ret = TEMP_FAILURE_RETRY(ftruncate(tgtfd, nextoffs));
        if (ret < 0) {
            perror("Truncate file to add hole");
            return -1;
        }

        /* Move file offset for target to after the newly added hole */
        nextoffs = TEMP_FAILURE_RETRY(lseek(tgtfd, nextoffs, SEEK_SET));
        if (nextoffs == (off_t)-1) {
            /* Something very strange happened,
               either some race condition changed the file,
               or the file is truncatable but not seekable
               or some external memory corruption,
               since EOVERFLOW can't happen with SEEK_SET */
            perror("Move to after newly added hole");
            return -1;
        }

        srcoffs = nextoffs;
    }

Filling it to the end

When finished with the copy-hole copy-data loop, we still have to fill the rest of the file, either by truncating it to fill in the final hole, or copying the rest of the data.

end_hole:
    nextoffs = TEMP_FAILURE_RETRY(lseek(srcfd, 0, SEEK_END));
    if (nextoffs == (off_t)-1) {
        perror("Seek to end of file");
        return -1;
    }
    if (srcoffs != nextoffs) {
        /* Not already at EOF, need to extend */
        int ret = TEMP_FAILURE_RETRY(ftruncate(tgtfd, nextoffs));
        if (ret < 0) {
            perror("Truncate to add hole at end of file");
            return -1;
        }
    }
    return copied;

end_data:
    {
        ssize_t ret = naive_contents_copy(srcfd, tgtfd);
        if (ret < 0)
            return ret;
        copied += ret;
    }
    return copied;
}

Integrating sparse copying into the program

Now that we've got our sparse_copy_contents, we need to amend our copy_file function to call it.

We can detect whether sparse copying is not possible by whether errno(3) gets set to EINVAL, so there's no harm in trying to sparsely copy a file first.

int copy_file(const char *source, const char *target, bool no_clobber) {
    int srcfd = -1;
    int tgtfd = -1;
    int ret = -1;
    srcfd = open(source, O_RDONLY);
    if (srcfd == -1) {
        perror("Open source file");
        return srcfd;
    }
    tgtfd = open(target, O_WRONLY|O_CREAT|(no_clobber ? O_EXCL : 0), 0600);
    if (tgtfd == -1) {
        perror("Open target file");
        return tgtfd;
    }

    ret = sparse_copy_contents(srcfd, tgtfd);
    if (ret >= 0)
        return ret;

    if (ret < 0 && errno != EINVAL) {
        /* Some error that wasn't from a sparse copy,
      so we can't fall back to something that would work */
        perror("Copy file");
        return -1;
    }

    return naive_contents_copy(srcfd, tgtfd);
}

I have omitted the definitions of copy_range and naive_contents_copy since they are not relevant to handling holes, but full program listing may be downloaded from my-mv.c.

Conclusion

Now when we move a file we don't end up taking more space, we don't copy data that we don't need so it goes faster, and files that treat holes specially, like disk images, will behave properly.

But files that may have holes are also likely to contain a lot of data, and even only copying the data we need will take a while.

Can we make this as fast as rename(2)?

thanks for another massively useful article. :)

I didn't know about SEEK_HOLE and SEEK_DATA at all. As far as I was aware the traditional way to do a sparse copy is set some bound on the number of zeroes that will constitute a hole and when n or more such zeroes are read, instead of copying the zeroes we seek past the current destination offset by n. This has the obvious disadvantage that any holes present in the source file that are smaller than n bytes will not be preserved as holes in the destination.

My man page for lseek(2) says that SEEK_HOLE and SEEK_DATA are only supported for more recent linux kernels and even then just for a selection of file systems: btrfs (3.1), OCFS (3.2), XFS (3.5), ext4 (3.8) and tmpfs (3.8). I wonder then whether it's worthwhile making a sparse copy function that falls back to the traditional sparse copy method if the new way is unsupported? (or maybe I'm living in the past having only made the switch from ext3 in the past year or so :D )

Comment by Gravious Sun Jan 1 14:33:17 2017

thanks for another massively useful article. :)

I didn't know about SEEK_HOLE and SEEK_DATA at all. As far as I was aware the traditional way to do a sparse copy is set some bound on the number of zeroes that will constitute a hole and when n or more such zeroes are read, instead of copying the zeroes we seek past the current destination offset by n. This has the obvious disadvantage that any holes present in the source file that are smaller than n bytes will not be preserved as holes in the destination.

There's also the disadvantage of taking a lot longer to read since you have to copy those zeroes into userspace and scan through them to see if they are indeed all zeroes. I think this is what tar, rsync and cp --sparse=always does, but I'd have to check.

It's also not fun for disk images, beyond the size difference, since you want to avoid writing what you don't need to when writing a disk image to a real disk, but you also need to write zeroes that are in the disk image, since you can end up with an invalid file system if you don't write those zeroes. https://source.tizen.org/documentation/reference/bmaptool/introduction was written by the Tizen guys for dealing with this. It produces and uses a separate file for the map of which sectors to copy, possibly because I know of no serialisation formats other than a disk image that preserves holes without dropping zero data sectors.

My man page for lseek(2) says that SEEK_HOLE and SEEK_DATA are only supported for more recent linux kernels and even then just for a selection of file systems: btrfs (3.1), OCFS (3.2), XFS (3.5), ext4 (3.8) and tmpfs (3.8). I wonder then whether it's worthwhile making a sparse copy function that falls back to the traditional sparse copy method if the new way is unsupported? (or maybe I'm living in the past having only made the switch from ext3 in the past year or so :D )

As far as I'm concerned those kernel versions are old enough that I can't feasibly test them, so I can't be sure my fallback would work right. I would accept a patch if provided one.

There are a couple of alternative fallbacks. The FIEMAP ioctl can be used to get information about which parts of the file contain what types of data, and the FIBMAP ioctl can tell you where on disk the data is. FIBMAP is old, so more widely available, but of limited use since it requires elevated privileges. FIEMAP is older than SEEK_{DATA,HOLE} and newer than FIBMAP, but has been historically the cause of file corruption. I don't recall whether old versions were just used improperly or whether there were bugs with it, but I steered clear of it and opted for the simpler interface of SEEK_{HOLE,DATA}.

Comment by Richard Maw Tue Jan 3 13:50:11 2017