Lars Wirzenius

What is version control?

Version control. Revision control. Software configuration management. These are all names for the same thing: keeping track of all the changes you make to your program's source code.

Because night-time coding

Version control is useful so you can remember what you've changed over time. For example, if you publish version 1.0 of your program, and later on version 2.0, and someone asks you what you changed, version control is what you need to answer that.

Seeing what changed is also important so you can figure out what caused your program to break. You release version 2.0, and now your frobniter no longer cogitates. You can't remember making any change to the cogitation module. Indeed, you could swear you haven't. But looking at the differences reveals that you did, indeed, make a change. What's more, it was 4 am in the night after your birthday party when you did that, which explains why you don't remember doing it.

Version control, when used properly, remembers every change you've made, not just releases, but much more fine grained. It will remember a snapshot of your work from every few minutes. Archived releases usually happen only fairly rarely.

Collaboration

Imagine a college or university terminal room. There's a few dozen computer terminals, or microcomputers, each one with a someone working on something. In one corner, there's a group of students working together on a group project. Every few minutes one of them asks, if it's safe to edit such and such a file. Every hour or two, there's a wail of anguish.

What they're doing is working together using a shared directory. Each of them is editing one file in the directory, and asking the others for permission. If two are editing the same file, they'll overwrite each other's changes. Sometimes they make mistakes and forget to ask for permission to edit a file.

Version control tools make collaboration easier. Everyone edits files on their own computer, and the version control tools synchronise the changes mostly automatically. The tools prevent anyone's changes from being overwritten.

Important concepts

There are many version control systems, but they share a few key concepts.

A repository is where the version control system stores all the versions of all the files.
When you've finished making some set of changes, you tell the version control system you've done that by making a commit.
You can create branches, which isolate work. Changes made to one branch don't affect any other branch. This allows you to do some experimental changes in one branch, without ruining the main line of development.
Branches can be merged, which means you take all the changes made in one branch and add them to another branch. The result contains everything from both branches. If your experimental changes turn out to be good, you can merge them into the main line of development. If they turn out not to be good, you can just drop the experimental branch, no harm done.
You can look at a diff (difference, list of changes) made from one version to another, or between branches. The diff is usually in the form of a unified diff, which looks slightly weird to begin with, but quickly becomes a very efficient way to see what's changed.

Version control systems are broadly classified into centralised and distributed systems. In a centralised one, every commit you make is immediately published to a repository on a server, and everyone collaborating on that project is using the same repository on the same server.

With a distributed system, there can be any number of repositories on any number of servers, and they need to be manually synchronised, using push and pull operations. Push sends your changes to the server, and pull retrives others' changes from the server.

The two important practical differences between centralised and distributed systems are that distributed systems are typically much, much better at merging, and individual developers are not at mercy of being granted commit access. The latter is very important for free software development.

With centralised systems, every commit requires write access to the repository on the server. For reasons of safety, security, and control, the set of people allowed to commit is usually quite restricted. This means that other developers are at a disadvantage: they can't commit their changes. This makes development awkward.

This is why distributed version control systems are replacing centralised ones, in free software development, but also in general.

Popular version control systems

There are many popular version control systems. Here's a short list of the free software ones:

git is a distributed version control system originally developed by Linus Torvalds for use with the Linux kernel. It is fairly efficient, and is used by a large number of free software projects now.
Mercurial is another distributed version control system. It's not as popular as git, but a number of well-known projects use it, for example Python.
Bazaar is also distributed, but failed to become popular outside Canonical and its Ubuntu distribution.
Subversion is a centralised system, which is quite popular and has been used by a large number of popular projects. The tide is changing in favour of git, however.
CVS is the grand-daddy of version control systems in the modern sense. It is outdated and archaic, but some projects still use it.

There are many more; Wikipedia has a list.

Which one should you learn? All the ones that are used by any of the projects you might want to contribute to.

Whicn one should you use for new projects? My vote is git, but if I tell you to use git, I'll be flamed by fans of other systems, I won't do that. Use your own judgement and preferences.

Example, with git

Here's an example of using git, for a project of your own. It doesn't show how to use a server for sharing code with others, only how to use it locally.

To start with, you should create a project directory, and initialise it.

mkdir ~/my-project
cd ~/my-project
git init .

This creates the .git subdirectory, where git keeps its own data about your source code.

After this, you can create some files. You can then add them to version control.

emacs foo.c
vi bar.c
git add foo.c bar.c

You can now commit the files to version control. Git will open an editor for you to write a commit message. The message should describe the changes your are committing.

git commit

You can now make further changes, and then look at what you've changed since the last commit.

emacs bar.c
git diff

When you're ready with a new set of changes, and you've reached a point where you want to commit, you do just that. You can do this by using git add again, or you can simplify this by using the -a option to git commit. You need to git add every new file, but -a will catch changes to files git already knows about.

git commit -a

You can then look at all the commits you've made.

git log

With various options to git log you can add more output. For example, the -p option will add a diff of the changes in each commit.

git log -p

For more information, see the git tutorial.

What should version control be used for?

Version control is most often used for program source code. However, you can use it for all sorts of things:

system configuration files: etckeeper
personal configuration files: vcshome
web site content: ikiwiki
sharing files between computers: git-annex

The very basics

What is a network

A network is typically considered to be two or more computers (or networkable devices) joined together such that they can communicate with one another in a defined and logical fashion. Networks vary in size dramatically between something as simple as the network formed by your computer and your ADSL router at home, all the way up to the Internet which is a globe-spanning network whose purpose is, in part, to allow you to read this article.

How does my computer find networks?

Depending on your operating system, and the particular choices you have made when installing it, there are a number of ways in which it might be keeping track of, connecting to, and making use of networks. Also you might have different physical kinds of network you could join, such as wired Ethernet or Wi-Fi networks.

Most modern Linux-based desktop operating systems tend to use a piece of software called Network Manager which looks after the details of connecting to networks for you. Under the bonnet Network Manager will be handling choice of network interface, acquiring an address on the network and dealing with finding out how to locate other systems on the network.

Under the bonnet

Network Interfaces

Your computer, particularly if it is a laptop, may have many network interfaces. There may be an Ethernet port on the computer, into which you can plug a cable the other end of which is plugged into another networking device such as a switch or router. Alternatively your computer may be fitted with a Wi-Fi interface which allows your computer to attach to a Wi-Fi access point or router without needing a cable.

Some network interfaces are virtual. In a common UNIX system you will have at least one virtual network interface -- the local loopback interface. This interface exists so that every UNIX computer in isolation is still capable of being a network. This simplifies network software design considerably since it never has to consider the case of there being no network at all.

Addresses, what they are and how to get them

On the network there needs to be a way to identify each device. This is called an address. Each network interface automatically comes with one address called its MAC address which uniquely identifies the network interface hardware itself. At the next level up, the protocols which run on the network itself define addressing schemes. The two common protocols you will hear of are IP version 4 and IP version 6 (or IPv4 and IPv6 for short).

In IPv4, an address is four small (less than 256) numbers, separated by dots. For example, the address 127.0.0.1 is one which every computer has (the local loopback address). There are a number of well known IPv4 addresses such as 8.8.8.8 which we will come to later.

Most small networks operate a protocol to allocate addresses to devices when they turn up. This is commonly the dynamic host configuration protocol, or DHCP. This protocol allows a device to connect to a previously entirely unknown network and obtain the information it needs to be a good citizen of the network (an address) and details on how to find access to the wider world (the address of the network's routers).

Name resolution

If all we had were numeric addresses then our lives would be at a very sad and difficult place. Fortunately there exists a number of mechanisms for turning more easily remembered names into the numeric addresses they are for. This process is called name resolution and almost every networked system in existence uses it to find the addresses of systems it needs to talk to.

In the early days of networking, this name to address mapping was simply maintained in a text file on every host. This file is still in existence on many systems as /etc/hosts although typically it contains nothing more than localhost and possibly the computer's name.

These days we use a system called the DNS which allows computers to not need to pre-know the names of everything they wish to connect to.

The wider world

The DNS and common record types

The Domain Name System (DNS) is a distributed (but not decentralised) system for turning names into addresses (and back again). The DNS is centralised by virtue of there being a well known (and agreed upon) set of root name servers whose addresses are built into most software associated with serving the DNS.

The DNS is essentially a distributed database where the data stored within it is sharded and the workload (and authority) distributed according to ownership information encoded in the configuration of the root name servers (or other name servers further down the chain). This delegation of service is done by separating the desired lookup by dots. For example, a name in the DNS might be yakking.branchable.com and as such, be split into yakking branchable and com. The authority for com can be looked up from the root name servers, then those can be queried for who knows about branchable and that will be another name server which can then be queried for yakking which will (hopefully) result in an address which can then be connected to, to retrieve useful articles containing information you wish to know. These sharded names are the 'domains' which give rise to the name: DNS.

The DNS database consists of a number of different record types. The most commonly encountered ones are:

NS: Name server records -- these state the name servers for a given DNS domain. For example, the NS records for pieni.net list the names ns1.pieni.net and ns2.pieni.net as being the name servers for the domain pieni.net.
A: Address records -- These give the address of a given name. For example a DNS entry may connect pieni.net to the address 95.142.166.37.
CNAME: Canonical name records -- These give the canonical name of an alias name. For example, you may have a DNS entry which says that the name www.pieni.net is more correctly known as simply pieni.net.
MX: Mail Exchanger records -- These indicate for a given domain name where the computers are which provide the mail service for that domain. For example, you may have an entry which says that the MX for pieni.net is 10 hrun.pieni.net. which means that at priority 10, hrun.pieni.net handles email for anything@pieni.net.

Given these different record types, it's possible that a given name may have many records. All record types can coexist with one another to a greater or lesser extent, although in practice, CNAME records do not co-exist very well with most other record types.

The Whois system

Along with the DNS, there is a mechanism for mapping these domain names (and indeed addresses) to their owners. The whois system links together domain names and address blocks with their legal owning entities. There are a number of well known whois servers. These are the servers operated by the regional organisations charged with maintaining the DNS and whois data.

Via the RIR servers a whois client can find out who owns various network entities such as addresses, names or network blocks. For example, if you issue the command whois pieni.net at the command line, you may get output including who registered the name, who is technically in charge of it, and which internet registry is providing registration services.

Protection (Firewalls)

Fundamentally a network is an intrinsically open world. If you can connect to the network (which might involve either physically being able to plug into it or perhaps knowing the password for the Wi-Fi network) then you can determine the other users of the network and connect to them indiscrimately.

In order to protect devices on the network there is a class of software called a firewall. A firewall might protect one network from another network or one device from a network, or some combination thereof. Firewalls essentially limit who can connect to whom and for what purposes. They exist at many levels of the networking stack and have many many features and operations they can perform.

Permissions

There are three groups of permissions: one for the owner of the file, one for the group, and one for everyone else. Each group consists of possible permissions: read, write, and execute. There are two common ways to represent the permissions: octal and "ls long form".

For example, here is what ls(1) shows for this article draft before I started this sentence:

-rw-rw-r-- 1 liw liw 836 Nov 10 16:33 drafts/liw-permissions.mdwn

The first column is the permission bits, plus the file type. Let's open that up:

the leading dash (-) indicates it is a regular file; the other common file type is the letter d for directories, but there's several others, which we'll skip here
then there's three groups of three letters: in the example above they are rw-, rw-, and r--, for the owner, group, and others, respectively
r means read permission
w means write permission
x means execute permission
- means lack of the permission that would be at that position

In other words, the article draft is readable and writeable by the owner and group, and readable by others, and not executable by anyone.

Reading and writing regular files is pretty obvious. Executability means the kernel will (try to) execute the file as a program. This works for both actual binaries, and for scripts in various languages. For an example of executable permissions, try ls -l /bin/*.

If you have read permissions in a directory, you can list its contents. If you have write permission, you can create or remove files in the directory. Removing a file requires modifying the directory it is in: the permissions of the file itself do not matter.

Execute permission for directories is different from files. A directory can't be meaningfully executed as a program. Instead, execute permission means whether you can access files (or subdirectories) in the directory. Accessing means using the directory in the path to a file: if you have read permission to a file, but not execute permission to its directory, you can't read the file.

Octal representation

Octal representation uses base-8 numbers. This is because there's three bits in each subset of permissions. Read permission is represented by 4, write by 2, and execute by 1. Thus, the article draft's permissions can be concisely represented as 0664 (where the leading 0 indicates octal: this is a Unix convention). After a while, this becomes easy to read and write.

Umask

The octal representation is used in a few corners of the Unix world, without a cleartext form available at all. Primarily among these is the umask, which is a bitmask of permissions to remove when a file is created. Properly behaving Unix programs create new files with a mode of 0666, unless there's a reason to use another mode, e.g., for security. The mode when creating a file is anded with the complement of the umask. A common umask is 0022 (i.e., bits for group and others for writing), which means that files are created so that the group and others can read the file, but only the owner can write. The point of this complication is to give the user the flexibility to easily control permissions of new files, which becomes important when several people need direct access to the files. See the umask(2) manual page for more information. To change the umask, you have use to the shell's built-in umask command (see your shell's manual page).

Manipulation

Permissions are manipulated with the chmod(1) command, which understands the octal form, but also has a mini-language for setting or changing the bits. See the manual page for details.

Other mode bits

For extra fun and games, look up sticky, setuid and setgid bits. These change how permission bits are interpreted. The details are intricate enough that you should read the manpages (chmod(2), stat(2)) to understand them correctly.

Post Scriptum

At this point, the metadata for this article draft looks like this:

$ stat drafts/liw-permissions.mdwn 
  File: `drafts/liw-permissions.mdwn'
  Size: 4562        Blocks: 16         IO Block: 4096   regular file
Device: fe01h/65025d    Inode: 786678      Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/     liw)   Gid: ( 1000/     liw)
Access: 2013-11-10 16:27:54.092026928 +0000
Modify: 2013-11-10 17:02:04.994196802 +0000
Change: 2013-11-10 17:02:04.994196802 +0000
 Birth: -

(Note the stat(1) command, which is another handy command line utility.)

Daniel Silverstone

Programming languages come in two major flavours -- those which are run directly from the source code as written (interpreted or scripting languages) and those which are first passed through a process which renders them in a form that the computer can execute directly (compiled languages). We have previously discussed some of the scripting languages and in this article we'll tackle some of the more common compiled languages.

It's worth noting here that many of the interpreted languages such as python, perl or lua have a compilation step, but the languages usually compile to a virtual machine's bytecode.

What languages will you find

C

On an average Linux-based system (or BSD for that matter) you will find a lot of code written in the language called C. Indeed the Linux kernel and most BSD kernels are written in C. Most of the core operating system utilities such as cp, ls etc are likely written in C unless you're on a very odd operating system.

C is considered the canonical programming language for UNIX systems and it's recommended that everyone who writes code on UNIX be aware of and familiar with C, even if you do not actively program in it on a day-to-day basis.

Many manual pages for system calls etc are written with their examples in C.

C++

In 1979, Bjarne Stroustrup started work on a new programming language which was eventually called C++. C++ adds a lot of new syntax and semantic elements to the language of C and these days is considered an entirely separate language which happens to have some similarities to C. C++ is favoured by some large projects run by people who subscribe to the object-oriented programming paradigms very strongly.

C++ is favoured by many GUI authors. The toolkit Qt is written in a meta-language built on top of C++. Then the KDE is written on top of Qt and as such is written in C++. C++ is also favoured on Windows since it's Microsoft's chosen systems programming language.

Objective C

Objective C marries the C language with some of the properties of the Smalltalk syntax and semantics. It is favoured by object-oriented programmers who don't like C++ but want more than C provides by default. The GNUStep project is written mostly in Objective C. Objective C is also favoured on the Mac OS platforms (Mac OS, Mac OS X and the various iOS variants).

Java

While Java is, strictly speaking, a compiled language it has similarities to scripting languages in that the compiler targets a virtual machine rather than the underlying machine code of the computer it is to run on. This forms the basis of Java's claim of "Write once, run everywhere".

Java is favoured by enterprise programmers and can also be found as the language underneath Android applications. Tools such as the Jenkins CI controller or the Eclipse integrated development environment are written in Java.

C#

Many people think C# is just a Microsoft .NET language, but with the mono toolchain, the "Common Language Runtime" is available to more than just Microsoft platforms. At one point mono was very popular for writing GTK+ based desktop applications such as Tomboy .

Go

An increasing number of small utilities and tools are written in a language, invented by (among others) some Googlers, called Go. Go is gaining traction among systems programmers who, jaded by C++, C# etc are drawn to it by the language features seemingly designed with them in mind. While there have been a great many projects written in Go, none of them seem to be mainstream applications at the time of writing this article.

Haskell

Haskell is a very different beast of a programming language in comparison to the above. For a start, Haskell is a functional programming language rather than an imperative language of some kind.

Haskell has been around for a long time, but is only recently gaining traction as a full-power systems programming language. A surprising number of tools available on a modern UNIX system are written in Haskell. Perhaps the most well known of which are Pandoc by John Macfarlane and git-annex written by Joey Hess.

Other compiled languages

There are many many more compiled programming languages such as:

Algol,
Pascal,
various BASICs,
various Lisps,
and ML of various kinds, particularly OCaml

Explore the wonderful world of compiled languages and marvel at the range and variety of syntaxes all of which are meant to compile down to the same kind of machine code.

←	Nov 2013					→
S	M	T	W	T	F	S
					1	2
3	4	5	6 Basics of version control systems	7	8	9
10	11	12	13 Basics of networking	14	15	16
17	18	19	20 Unix permissions and mode bits	21	22	23
24	25	26	27 Compiled languages	28	29	30