I recently had cause to debug some string handling problems in a python program.

The root cause of which was diagnosed to be a library function returning a file path as a unicode string.

I believe this to be fundamentally wrong, so I decided to write this article in the same vein as the Falsehoods programmers believe articles.

I have no illusions about this being an exhaustive list, so if you have any to add, please comment.

  • Paths fit in PATH_MAX.

    This seductively named constant does not mean the full length of a path.

    It was only ever meant to be maximum size of an individual component.

  • Path components fit in PATH_MAX

    We no longer live in those times.

    You can use pathconf(3) to determine this limit, but you cannot rely on it, since a different filesystem may be mounted, or a symlink changed, at any time between checking this and using it.

    It is better to dynamically allocate memory for strings on the heap in a realloc(3) loop.

  • Two files with different paths refer to different files

    Apart from /.. being the same path as /, hard links, symbolic links and bind-mounts all allow different file names to resolve to the same file.

  • Two files with the same path refer to the same file

    The same process at different times may see a different file, since another process may have moved it, changed a symbolic link, or mounted a filesystem.

    The process itself may have used chroot(2) to change its view of the filesystem.

    Different processes can have different views of the filesystem at the same time, since they may occupy different mount namespaces, different roots, or refer to different paths in proc(5), which shows different values of /proc/self to each process, and shows different processes in different process namespaces.

  • All files have visibly distinct file paths

    When file paths are interpreted as unicode different sequences of bytes can produce identical looking strings.

  • File paths are case insensitive

  • File paths are case sensitive

    Paths are case sensitive in POSIX file systems, and insensitive on Mac and Windows file systems.

  • File paths are unicode

  • File paths have an encoding

    Under Linux, file paths are any sequence of bytes terminated with a NUL.

  • File paths have no encoding

    Under Windows, file paths are Unicode.

  • File paths cannot contain whitespace

    Shell scripts often get this wrong, but there's nothing preventing you putting spaces in.

  • File paths cannot contain : characters.

    POSIX shells use : as the path separator. make uses : to separate build targets from dependent rules. Old versions of tar would interpret a file paths with : in file names as a remote tape address.

  • File paths cannot contain * or ? characters

    If you make use of your shell's globbing feature, you need to escape or quote glob characters.

  • File paths can contain * or ? characters

    Windows does not allow you to create files with glob characters in.

  • Paths may contain only printable characters.

    You can have fun and embed a newline character in a file name, which makes old versions of ls(1) appear to print two file names.

  • File path components can contain any string

    The path separator cannot be part of a path component (excepting filesystem corruption).

    The names . and .. are reserved for current directory and previous.

  • File path components can contain any string except ".", "..", or "/".

    Windows reserves CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.

  • File paths have 1 . separating the name from the file extension.

    You may have a file without an extension, and you may have multiple ..

  • File name extensions are 3 characters long.

  • Path components are separated with /.

    Windows used to only support \, now it supports it in addition. More obscure operating systems like RISC OS use . as a path separator.

  • Absolute paths start with a /.

    Windows has drive path (e.g. C:\), and UNC paths which start with \\.

  • foo and foo/../foo always point to the same directory.

    If the first foo is a symbolic link, then following it takes you to the directory it is in. The .. takes you to the parent directory of that, which may contain an entirely different directory called foo.

  • Symbolic links may not be empty

  • Symbolic links point to a file that exists.

    You can put any text that is also a valid file path in a symbolic link. That text may not refer to a file that currently exists. This is called a dangling symbolic link.

  • Symbolic links that don't point to a file that exists are dangling.

    There're magic symlinks in proc(5), that when read with readlink(2) display an ID and type, which if you were to pass to open(2) would create a new file, but if you open(2)'d the file directly would give something else.

Mac OS X filesystems might be case-sensitive, but the default is not to be, and the system stuff does not assume case sensitivity.

(side-note: even the UNIX layer seems to work surprisingly well given this. The one clash I've discovered in 2 years of use is head (the pager) versus HEAD (the convenience binary for libwww-perl)

Comment by Jon Dowland Wed Jan 20 13:49:40 2016