Richard Maw File systems

File systems are your interface to store your data. Modern file systems offer a hierarchical view of your data, though historical file systems have been flat.

However, your computer's hardware just knows about blocks of data. Your operating system is responsible for translating the nice, human friendly hierarchy into blocks of data that is written to a disk.

On Linux, you have the advantage of being able to choose from a variety of file systems, suitable for different workloads. When installing your operating system you may be given the opportunity to pick one. It's important to know what you're choosing, so I'm going to describe a bit of terminology, then descibe some common options.

Terminology

Journalling

In the case of a crash, it's helpful if you can determine what state your filesystem is in. One approach for this is to have a journal, where you write your changes to that before the rest of the storage, and in the case of failure, the journal can be re-played to get the file system into a consistent state.

Copy-on-write

Copy on write file systems, when re-writing new data to a file, instead of overwriting the old data, create a copy of the data, write to that, then update the original file pointers to point at the new data.

This allows similar atomicity guarantees to journalling, since either the pointers point to the valid new data, or they point to the old data.

fsck vs scrub

fsck stands for "File System ChecK". This is a step that can be performed on a file system before it is being actively used to check its integrity and fix errors.

This has the down-side that the file system can't be used while it's being checked, so for routine maintanence the file system must be un-mounted.

Some file systems instead offer a "scrub" operation, which can be performed while the file system is being used, and offers the same functionality.

File systems

FAT

FAT stands for File Allocation Table, it is a relatively old file system, files are limited to 4GiB in size and file names are case insensitive.

It has the benefit of being widely portable, being available on many platforms. For this reason storage devices are often pre-formatted as FAT, just so less technical users don't assume the device is broken and return it.

Its portability makes it useful for USB flash drives and the partition you use for /boot.

It would be a poor choice for your root file system.

NTFS

This is Microsoft's primary file system. Its data structures don't limit the maximum file size to 4GiB.

As a commenter pointed out, NTFS is case sensitive, but not through the Windows API, which maintains case insensitivity for compatibility with older software that assumed insensitivty, since it was dealing with FAT.

On Windows it is case preserving, so if you create a file called "Foo", you can read it as "foo", but when you list the contents of the directory, it is shown as "Foo", rather than "FOO", as FAT traditionally does.

This makes it a better choice for storage media, as Linux is also able to read it.

It is still inadvisable to use NTFS as your root file system on Linux, since it's primary use is reading disks that are also used by Windows machines, rather than being an installation's root file system.

ext2, ext3 and ext4

The ext file systems are Linux's primary file systems and are usually the default option when installing Linux distributions.

Despite having a similar name, they are different beasts. ext2 is rather primitive, only really useful with old bootloaders.

ext3 is more advanced, though active development has moved on to ext4.

ext4 supports journalling, uses extents for its storage and supports extended attributes, where additional metadata can be assigned to a file.

There are third-party tools to read ext file systems from Windows, but NTFS support in Linux is better than ext support in Windows.

XFS

XFS is a development from Silicon Graphics. It exceeds ext4's features, including the ability to take snapshots of the logical state of the file system, and was the source of extended attributes.

It is available on IRIX and Linux, so portability is not its strong point, hence would not be useful on a USB flash drive, but it is an excellent choice for your root file system.

ZFS

ZFS is a product of Sun Microsystems, later bought by Oracle. It is a very advanced file system, offering all the features mentioned above plus more.

This is a copy-on-write file system, unlike the above, which were either journalling, or wrote to the blocks directly.

Being copy-on-write allows for deduplication, since if multiple files have the same data, the file system can point both files to the same data, since if it's changed in one file, the other one won't have its data changed.

Live file system checking is possible with the scrub command, so downtime is not needed to perform maintanance.

It can use multiple physical devices as its storage pool.

Its license is incompatible with the Linux kernel, so kernel support is provided by a third-party module, which makes it possible that a kernel update could leave your file system unreadable, since the ZFS kernel module is un-readable.

Loading an external kernel module is slower than it being built in, so this impacts boot speed.

Despite its complexity, ZFS is also available on the Solaris and BSD unices.

BTRFS

Work on BTRFS was initiated by Oracle to be a competitor to ZFS, this is no longer the motivating factor, since Oracle acquired Sun, but BTRFS is likely to become the default Linux file system in the future.

It is nearly as featureful as ZFS, only missing the online deduplication, which the BTRFS developers expect to complete in a couple of Linux kernel releases.

Its design allows for a transition between ext and btrfs with the btrfs-convert tool, by saving the ext metadata elsewhere and re-mapping ext's data to btrfs extents. It still offers the original file system's data as a read-only disk image that can be mounted. Reverting back to ext is done by reinstating the ext metadata.

Unfortunately, it has a reputation of being unstable, corrupting your data and becoming unusable. This is a natural stage of file system maturity though, and BTRFS is my preferred root file system.

Posted Wed Feb 5 12:00:09 2014 Tags:

This series of articles (The Truisms) is aimed at imparting some of the things we find to hold true no matter the project one undertakes in the space carved out by our interests in Open Source and Free Software engineering. The first article in the series was If you don't know why you're doing it, you shouldn't be doing it.

When I first set out to build Gitano, I knew that I was doing it because I missed the ease with which I could publish bzr repositories on a server and I'd looked at Gitolite and knew I didn't want to use that. I'd taken one look at Github and discounted that too. But I didn't start with a plan for how to achieve any of this. Indeed I started with a name which ended up being taken by another project well before I was ready to release, meaning I had to rename my project before I'd even made one release.

Being a seasoned hacker I immediately started down a path of yak shaving by writing some utility code I'd need to let me interact with Git repositories from the language I'd chosen to implement Gitano in, without any kind of cohesive plan of what such a beast would look like, or work like. It took me a long time to dig my way back out of that hole. Eventually I made a plan for what I'd need to achieve a minimum viable product for myself and that really focussed me toward the right path.

As with reasons for writing things, plans don't have to be fully formed but they do need to be present to prevent you from wandering off the path into places which might seem interesting or useful but often don't help you achieve your goals. A plan can be as nebulous as "I will just hack and see what happens" but if it is, you need to be aware of what that means for how likely you are to achieve your ends in a timely fashion. I strongly recommend that you plan your projects at least at a simple bullet point list kind of level. For Gitano my plan was approximately:

  1. Build git abstraction for reading config
  2. Add push and pull support over ssh
  3. Add admin commands
  4. Add gitweb support

Things didn't necessarily go in that order exactly, but all in all, I achieved those basic goals and I had my minimal-viable-product. Naturally I didn't release this, but I did start using it myself. In addition to actions to perform, plans might include architectural decisions such as what language to implement your project in, or tooling decisions associated with the project. Just remember that no plan survives contact with the enemy and as such you should always be prepared to adjust your plan when new information comes to light.

So, here's your homework for this truism: Pick one project you're currently floundering on -- it might be related to open source, or it might simply be something you need to get done around the house. Write down a two or three point plan to achieve your goal and see if that helps you to get there any faster.

Posted Wed Feb 12 12:00:10 2014 Tags:
Lars Wirzenius Naming projects and stuff

Every project and program needs a name, but naming can be very difficult:

  • A name should be unique, so people don't confuse what you're doing with what someone else is doing. A UUID4 is thus a pretty good name.
  • A name should be easily pronounceable so that people feel comfortable using it in discussion. Likewise, it should be easy to write correctly. A UUID4 is thus a pretty bad name.
  • A name should not be a word or phrase that is commonly used already, so that search engines make it easy to find your project.
  • A name shouldn't be offensive or make people uncomfortable, in any language you care about.

As examples, I'll describe three of my own projects.

In the early-to-mid 1990s I wrote a GUI text editor for Linux, running under X11, back when fvwm2 was the new hot stuff, and having any graphical user interface at all for any task was a marvellous thing to be spoken of in awe. My editor was inspired by an article I read about sam, the text editor in Plan 9, though completely different. One of the things I liked about sam was its elegant simplicity. Thus, I named my editor the "Simple editor for X", or "sex" for short. Oh my youthful days, how lame I was. That was a name that made it impossible to find my editor with most search engines. On the other hand, it was unique (nobody else was as lame as I was), and easily pronounceable. However, people didn't really enjoy talking about "using sex", for whatever reason.

In the mid 2000s I wanted to host some mailing lists on my own, and didn't like any of the existing solutions, so I wrote my own. I quite liked the movie Dead Men Don't Wear Plaid. One of the features of the film was lists of names, titled either friends or enemies of Carlotta". Due to the way the plot twisted, I named my mailing list manager "Enemies of Carlotta" (or "eoc" for short). It was fairly easy to find with a search engine, but you had to wade through lots of hits about the movie. Other than that, its only problem was that people didn't understand the plot of the movie, or hadn't seen the movie, and wanted to know why it wasn't friends instead of enemies.

A bit later, I started writing my own program for making backups. Having learnt from the above two mistakes, I wanted a name that was short, unique, pronounceable, and didn't make people look at me weirdly. After a lot of candidates, I eventually chose obnam as the name, not because I particularly liked it, but because having a name was obligatory. Thus, I chose an obligatory name.

For a while I was satisfied. Then Barack Obnam, sorry, Obama got elected as the president of the USA.

So, whatever name you choose, it's not going to be unproblematic, but if you can avoid a name that makes you look like a lame geek or an unfriendly person, you're doing OK.

A number of projects have names generated by a password generator such as pwgen. A program that produces pronounceable passwords works fairly well here. Even such names should be checked using search engines, just in case it's in use already. It might, for example, be a word in Finnish for "your fish tail has a funny hat", which makes no sense at all.

After you have name, you can create the project directory, and start hacking. You can also register domain names, set mailing lists, etc, but you probably shouldn't. It's better to get something minimally useable done first, before committing to a name. You might want to change a name, for example, because while you're hacking, someone else may have decided to use the same name. (See Daniel's Truism 2 article for an example.)

Posted Wed Feb 19 12:00:14 2014 Tags:

This series of articles (The Truisms) is aimed at imparting some of the things we find to hold true no matter the project one undertakes in the space carved out by our interests in Open Source and Free Software engineering. The first article in the series was If you don't know why you're doing it, you shouldn't be doing it.

A long long time ago, in the dim and distant past, revision control consisted entirely of making a tarball of your code tree from time to time and calling that "releases". We passed from there to "local" revision control such as RCS and SCCS and from there to ways to collaborate with tools like CVS and thence to SVN, TLA, BZR, git etc.

If you'd like to know more about revision control, Lars wrote a very good article on the topic.

When I started writing open source code seriously, waaaay back in the dim and distant past, I used the tarball method (although they were zip files because I was stuck on Windows in those days since Linux hadn't really gained enough popular traction to work on my computer). The best way to "push" to that kind of revision control was to put it up on a website and persuade others to download it -- that way when your hard drive broke you could re-download it and not lose too much work. Since then, with CVS, SVN, etc I would have a server online somewhere and push my code to that. Said server was also backed up, but even if it hadn't been I'd have had two copies of my code.

It's possible, even now, for me to look back at some of the code I wrote a long long time ago (and cringe) thanks to conversions of those revision control repositories over time. And this is one advantage to having pushed the code -- it does still exist. From time to time I'm amused enough by something old to rewrite it in a modern way. For example the time I played with my countdown program (originally written in Pocket-C for PalmOS) and then ended up making a bad Haskell rewrite which didn't quite work, but was fun. That replacement is also pushed to revision control -- I wonder if any of you can find it and fix my bugs?

Keeping older versions of code can also be instructive. I'm very glad for the revision history for my Gitano software (mentioned in truism 2) because I can look back at different ways I did things and remember why things are the way they are now. This holds true for so many projects I'm part of that I sometimes wonder how I coped before I kept everything in revision control.

There are days when I am sad that some of my very early work is no longer available to me because it only ever existed on floppy discs which I can no longer access, even if I knew where they were. I'm hopeful that one day I'll clear a box in a house somewhere and find them all again and have a joyful time doing data archaeology. I'd love to find the statistical analysis of the random number generator I wrote for my Dad's computer back when I was 5 years old.

However, enough reminiscing. I am only trying to illustrate the point that if you have not kept your code in revision control then you miss out on so much that revision control provides, but that more importantly, if you have not pushed that revision controlled content elsewhere in some manner then you lose out just as much because data is fragile.

Your homework this time depends on your attitude to revision control. If you subscribe to this approach already then you should go find something you wrote a while ago and browse the code and the revision history and have a lovely time reminiscing over your past self and how clever/stupid they were. If you do not currently subscribe to this approach then your homework is to go find something you wrote which you're very proud of and love to bits, and then imagine how you'd feel if you lost it forever. Then you can put it under revision control and push it somewhere for everyone else to admire.

Posted Wed Feb 26 12:00:13 2014 Tags: