I recently had cause to debug some string handling problems in a python program.
The root cause of which was diagnosed to be a library function returning a file path as a unicode string.
I believe this to be fundamentally wrong, so I decided to write this article in the same vein as the Falsehoods programmers believe articles.
I have no illusions about this being an exhaustive list, so if you have any to add, please comment.
Paths fit in
This seductively named constant does not mean the full length of a path.
It was only ever meant to be maximum size of an individual component.
Path components fit in
We no longer live in those times.
You can use pathconf(3) to determine this limit, but you cannot rely on it, since a different filesystem may be mounted, or a symlink changed, at any time between checking this and using it.
It is better to dynamically allocate memory for strings on the heap in a realloc(3) loop.
Two files with different paths refer to different files
/..being the same path as
/, hard links, symbolic links and bind-mounts all allow different file names to resolve to the same file.
Two files with the same path refer to the same file
The same process at different times may see a different file, since another process may have moved it, changed a symbolic link, or mounted a filesystem.
The process itself may have used chroot(2) to change its view of the filesystem.
Different processes can have different views of the filesystem at the same time, since they may occupy different mount namespaces, different roots, or refer to different paths in proc(5), which shows different values of
/proc/selfto each process, and shows different processes in different process namespaces.
All files have visibly distinct file paths
When file paths are interpreted as unicode different sequences of bytes can produce identical looking strings.
File paths are case insensitive
File paths are case sensitive
Paths are case sensitive in POSIX file systems, and insensitive on Mac and Windows file systems.
File paths are unicode
File paths have an encoding
Under Linux, file paths are any sequence of bytes terminated with a
File paths have no encoding
Under Windows, file paths are Unicode.
File paths cannot contain whitespace
Shell scripts often get this wrong, but there's nothing preventing you putting spaces in.
File paths cannot contain
POSIX shells use
:as the path separator.
:to separate build targets from dependent rules. Old versions of
tarwould interpret a file paths with
:in file names as a remote tape address.
File paths cannot contain
If you make use of your shell's globbing feature, you need to escape or quote glob characters.
File paths can contain
Windows does not allow you to create files with glob characters in.
Paths may contain only printable characters.
You can have fun and embed a newline character in a file name, which makes old versions of ls(1) appear to print two file names.
File path components can contain any string
The path separator cannot be part of a path component (excepting filesystem corruption).
..are reserved for current directory and previous.
File path components can contain any string except
Windows reserves CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.
File paths have 1
.separating the name from the file extension.
You may have a file without an extension, and you may have multiple
File name extensions are 3 characters long.
Path components are separated with
Windows used to only support
\, now it supports it in addition. More obscure operating systems like RISC OS use
.as a path separator.
Absolute paths start with a
Windows has drive path (e.g.
C:\), and UNC paths which start with
foo/../fooalways point to the same directory.
If the first
foois a symbolic link, then following it takes you to the directory it is in. The
..takes you to the parent directory of that, which may contain an entirely different directory called
Symbolic links point to a file that exists.
You can put any text that is also a valid file path in a symbolic link. That text may not refer to a file that currently exists. This is called a dangling symbolic link.
Symbolic links that don't point to a file that exists are dangling.
There're magic symlinks in proc(5), that when read with readlink(2) display an ID and type, which if you were to pass to open(2) would create a new file, but if you open(2)'d the file directly would give something else.