What is version control?
Version control. Revision control. Software configuration management. These are all names for the same thing: keeping track of all the changes you make to your program's source code.
Because night-time coding
Version control is useful so you can remember what you've changed over time. For example, if you publish version 1.0 of your program, and later on version 2.0, and someone asks you what you changed, version control is what you need to answer that.
Seeing what changed is also important so you can figure out what caused your program to break. You release version 2.0, and now your frobniter no longer cogitates. You can't remember making any change to the cogitation module. Indeed, you could swear you haven't. But looking at the differences reveals that you did, indeed, make a change. What's more, it was 4 am in the night after your birthday party when you did that, which explains why you don't remember doing it.
Version control, when used properly, remembers every change you've made, not just releases, but much more fine grained. It will remember a snapshot of your work from every few minutes. Archived releases usually happen only fairly rarely.
Collaboration
Imagine a college or university terminal room. There's a few dozen computer terminals, or microcomputers, each one with a someone working on something. In one corner, there's a group of students working together on a group project. Every few minutes one of them asks, if it's safe to edit such and such a file. Every hour or two, there's a wail of anguish.
What they're doing is working together using a shared directory. Each of them is editing one file in the directory, and asking the others for permission. If two are editing the same file, they'll overwrite each other's changes. Sometimes they make mistakes and forget to ask for permission to edit a file.
Version control tools make collaboration easier. Everyone edits files on their own computer, and the version control tools synchronise the changes mostly automatically. The tools prevent anyone's changes from being overwritten.
Important concepts
There are many version control systems, but they share a few key concepts.
- A repository is where the version control system stores all the versions of all the files.
- When you've finished making some set of changes, you tell the version control system you've done that by making a commit.
- You can create branches, which isolate work. Changes made to one branch don't affect any other branch. This allows you to do some experimental changes in one branch, without ruining the main line of development.
- Branches can be merged, which means you take all the changes made in one branch and add them to another branch. The result contains everything from both branches. If your experimental changes turn out to be good, you can merge them into the main line of development. If they turn out not to be good, you can just drop the experimental branch, no harm done.
- You can look at a diff (difference, list of changes) made from one version to another, or between branches. The diff is usually in the form of a unified diff, which looks slightly weird to begin with, but quickly becomes a very efficient way to see what's changed.
Version control systems are broadly classified into centralised and distributed systems. In a centralised one, every commit you make is immediately published to a repository on a server, and everyone collaborating on that project is using the same repository on the same server.
With a distributed system, there can be any number of repositories on any number of servers, and they need to be manually synchronised, using push and pull operations. Push sends your changes to the server, and pull retrives others' changes from the server.
The two important practical differences between centralised and distributed systems are that distributed systems are typically much, much better at merging, and individual developers are not at mercy of being granted commit access. The latter is very important for free software development.
With centralised systems, every commit requires write access to the repository on the server. For reasons of safety, security, and control, the set of people allowed to commit is usually quite restricted. This means that other developers are at a disadvantage: they can't commit their changes. This makes development awkward.
This is why distributed version control systems are replacing centralised ones, in free software development, but also in general.
Popular version control systems
There are many popular version control systems. Here's a short list of the free software ones:
- git is a distributed version control system originally developed by Linus Torvalds for use with the Linux kernel. It is fairly efficient, and is used by a large number of free software projects now.
- Mercurial is another distributed version control system. It's not as popular as git, but a number of well-known projects use it, for example Python.
- Bazaar is also distributed, but failed to become popular outside Canonical and its Ubuntu distribution.
- Subversion is a centralised system, which is quite popular and has been used by a large number of popular projects. The tide is changing in favour of git, however.
- CVS is the grand-daddy of version control systems in the modern sense. It is outdated and archaic, but some projects still use it.
There are many more; Wikipedia has a list.
Which one should you learn? All the ones that are used by any of the projects you might want to contribute to.
Whicn one should you use for new projects? My vote is git, but if I tell you to use git, I'll be flamed by fans of other systems, I won't do that. Use your own judgement and preferences.
Example, with git
Here's an example of using git, for a project of your own. It doesn't show how to use a server for sharing code with others, only how to use it locally.
To start with, you should create a project directory, and initialise it.
mkdir ~/my-project
cd ~/my-project
git init .
This creates the .git
subdirectory, where git keeps
its own data about your source code.
After this, you can create some files. You can then add them to version control.
emacs foo.c
vi bar.c
git add foo.c bar.c
You can now commit the files to version control. Git will open an editor for you to write a commit message. The message should describe the changes your are committing.
git commit
You can now make further changes, and then look at what you've changed since the last commit.
emacs bar.c
git diff
When you're ready with a new set of changes, and you've reached a
point where you want to commit, you do just that. You can do this by
using git add
again, or you can simplify this by using the -a
option to git commit
. You need to git add
every new file, but -a
will catch changes to files git already knows about.
git commit -a
You can then look at all the commits you've made.
git log
With various options to git log
you can add more output. For
example, the -p
option will add a diff of the changes in each
commit.
git log -p
For more information, see the git tutorial.
What should version control be used for?
Version control is most often used for program source code. However, you can use it for all sorts of things:
- system configuration files: etckeeper
- personal configuration files: vcshome
- web site content: ikiwiki
- sharing files between computers: git-annex
See also
The very basics
What is a network
A network is typically considered to be two or more computers (or networkable devices) joined together such that they can communicate with one another in a defined and logical fashion. Networks vary in size dramatically between something as simple as the network formed by your computer and your ADSL router at home, all the way up to the Internet which is a globe-spanning network whose purpose is, in part, to allow you to read this article.
How does my computer find networks?
Depending on your operating system, and the particular choices you have made when installing it, there are a number of ways in which it might be keeping track of, connecting to, and making use of networks. Also you might have different physical kinds of network you could join, such as wired Ethernet or Wi-Fi networks.
Most modern Linux-based desktop operating systems tend to use a piece of software called Network Manager which looks after the details of connecting to networks for you. Under the bonnet Network Manager will be handling choice of network interface, acquiring an address on the network and dealing with finding out how to locate other systems on the network.
Under the bonnet
Network Interfaces
Your computer, particularly if it is a laptop, may have many network interfaces. There may be an Ethernet port on the computer, into which you can plug a cable the other end of which is plugged into another networking device such as a switch or router. Alternatively your computer may be fitted with a Wi-Fi interface which allows your computer to attach to a Wi-Fi access point or router without needing a cable.
Some network interfaces are virtual. In a common UNIX system you will have at least one virtual network interface -- the local loopback interface. This interface exists so that every UNIX computer in isolation is still capable of being a network. This simplifies network software design considerably since it never has to consider the case of there being no network at all.
Addresses, what they are and how to get them
On the network there needs to be a way to identify each device. This is called an address. Each network interface automatically comes with one address called its MAC address which uniquely identifies the network interface hardware itself. At the next level up, the protocols which run on the network itself define addressing schemes. The two common protocols you will hear of are IP version 4 and IP version 6 (or IPv4 and IPv6 for short).
In IPv4, an address is four small (less than 256) numbers, separated by dots. For example, the address 127.0.0.1 is one which every computer has (the local loopback address). There are a number of well known IPv4 addresses such as 8.8.8.8 which we will come to later.
Most small networks operate a protocol to allocate addresses to devices when they turn up. This is commonly the dynamic host configuration protocol, or DHCP. This protocol allows a device to connect to a previously entirely unknown network and obtain the information it needs to be a good citizen of the network (an address) and details on how to find access to the wider world (the address of the network's routers).
Name resolution
If all we had were numeric addresses then our lives would be at a very sad and difficult place. Fortunately there exists a number of mechanisms for turning more easily remembered names into the numeric addresses they are for. This process is called name resolution and almost every networked system in existence uses it to find the addresses of systems it needs to talk to.
In the early days of networking, this name to address mapping was simply
maintained in a text file on every host. This file is still in existence on
many systems as /etc/hosts
although typically it contains nothing more than
localhost and possibly the computer's name.
These days we use a system called the DNS which allows computers to not need to pre-know the names of everything they wish to connect to.
The wider world
The DNS and common record types
The Domain Name System (DNS) is a distributed (but not decentralised) system for turning names into addresses (and back again). The DNS is centralised by virtue of there being a well known (and agreed upon) set of root name servers whose addresses are built into most software associated with serving the DNS.
The DNS is essentially a distributed database where the data stored within it is sharded and the workload (and authority) distributed according to ownership information encoded in the configuration of the root name servers (or other name servers further down the chain). This delegation of service is done by separating the desired lookup by dots. For example, a name in the DNS might be yakking.branchable.com and as such, be split into yakking branchable and com. The authority for com can be looked up from the root name servers, then those can be queried for who knows about branchable and that will be another name server which can then be queried for yakking which will (hopefully) result in an address which can then be connected to, to retrieve useful articles containing information you wish to know. These sharded names are the 'domains' which give rise to the name: DNS.
The DNS database consists of a number of different record types. The most commonly encountered ones are:
- NS: Name server records -- these state the name servers for a given DNS
domain. For example, the NS records for pieni.net list the names
ns1.pieni.net
andns2.pieni.net
as being the name servers for the domain pieni.net. - A: Address records -- These give the address of a given name. For example
a DNS entry may connect pieni.net to the address
95.142.166.37
. - CNAME: Canonical name records -- These give the canonical name of an alias name. For example, you may have a DNS entry which says that the name www.pieni.net is more correctly known as simply pieni.net.
- MX: Mail Exchanger records -- These indicate for a given domain name
where the computers are which provide the mail service for that domain.
For example, you may have an entry which says that the MX for pieni.net
is
10 hrun.pieni.net.
which means that at priority 10, hrun.pieni.net handles email for anything@pieni.net.
Given these different record types, it's possible that a given name may have many records. All record types can coexist with one another to a greater or lesser extent, although in practice, CNAME records do not co-exist very well with most other record types.
The Whois system
Along with the DNS, there is a mechanism for mapping these domain names (and indeed addresses) to their owners. The whois system links together domain names and address blocks with their legal owning entities. There are a number of well known whois servers. These are the servers operated by the regional organisations charged with maintaining the DNS and whois data.
Via the RIR servers a whois client can find out who owns various network
entities such as addresses, names or network blocks. For example, if you issue
the command whois pieni.net
at the command line, you may get output including
who registered the name, who is technically in charge of it, and which internet
registry is providing registration services.
Protection (Firewalls)
Fundamentally a network is an intrinsically open world. If you can connect to the network (which might involve either physically being able to plug into it or perhaps knowing the password for the Wi-Fi network) then you can determine the other users of the network and connect to them indiscrimately.
In order to protect devices on the network there is a class of software called a firewall. A firewall might protect one network from another network or one device from a network, or some combination thereof. Firewalls essentially limit who can connect to whom and for what purposes. They exist at many levels of the networking stack and have many many features and operations they can perform.
Further reading
If you wish to know more about networking, you might look up information about:
- The TCP/IP Model
- The OSI stack
- The name service switch
- The resolv.conf file.
Also you might want to have a play with:
You might want to play with these commandline tools:
If you're interested in firewalls, you might investigate
And if you simply wish to know more about networking in a general sense then you could do worse than to look at:
- http://www.computerhope.com/jargon/n/network.htm
- http://www.igcseict.info/theory/4/netw/
- http://en.wikipedia.org/wiki/Computer_network
- http://www.webopedia.com/TERM/N/network.html
Each file and directory in the filesystem carries several bits of metadata:
- the (numeric) user id than owns the file
- the (numeric) group id that owns the file
- some permission bits to specify who can do what with the file
- some additional metadata bits
The permission and other metadata bits form the mode of the file. For a full list, see the chmod(2) and stat(2) manual pages. This article summarises the most important bits.
Permissions
There are three groups of permissions: one for the owner of the file, one for the group, and one for everyone else. Each group consists of possible permissions: read, write, and execute. There are two common ways to represent the permissions: octal and "ls long form".
For example, here is what ls(1) shows for this article draft before I started this sentence:
-rw-rw-r-- 1 liw liw 836 Nov 10 16:33 drafts/liw-permissions.mdwn
The first column is the permission bits, plus the file type. Let's open that up:
- the leading dash (
-
) indicates it is a regular file; the other common file type is the letterd
for directories, but there's several others, which we'll skip here - then there's three groups of three letters: in the example above
they are
rw-
,rw-
, andr--
, for the owner, group, and others, respectively r
means read permissionw
means write permissionx
means execute permission-
means lack of the permission that would be at that position
In other words, the article draft is readable and writeable by the owner and group, and readable by others, and not executable by anyone.
Reading and writing regular files is pretty obvious. Executability
means the kernel will (try to) execute the file as a program. This
works for both actual binaries, and for scripts in various languages.
For an example of executable permissions, try ls -l /bin/*
.
If you have read permissions in a directory, you can list its contents. If you have write permission, you can create or remove files in the directory. Removing a file requires modifying the directory it is in: the permissions of the file itself do not matter.
Execute permission for directories is different from files. A directory can't be meaningfully executed as a program. Instead, execute permission means whether you can access files (or subdirectories) in the directory. Accessing means using the directory in the path to a file: if you have read permission to a file, but not execute permission to its directory, you can't read the file.
Octal representation
Octal representation uses base-8 numbers. This is because there's three bits in each subset of permissions. Read permission is represented by 4, write by 2, and execute by 1. Thus, the article draft's permissions can be concisely represented as 0664 (where the leading 0 indicates octal: this is a Unix convention). After a while, this becomes easy to read and write.
Umask
The octal representation is used in a few corners of the Unix world,
without a cleartext form available at all. Primarily among these is
the umask
, which is a bitmask of permissions to remove when a file
is created. Properly behaving Unix programs create new files with a
mode of 0666, unless there's a reason to use another mode, e.g., for
security. The mode when creating a file is anded with the complement of the
umask. A common umask is 0022 (i.e., bits for group and others for writing),
which means that files are created so that the group and others can
read the file, but only the owner can write. The point of this
complication is to give the user the flexibility to easily control
permissions of new files, which becomes important when several people
need direct access to the files. See the umask(2) manual page for
more information. To change the umask, you have use to the shell's
built-in umask
command (see your shell's manual page).
Manipulation
Permissions are manipulated with the chmod(1) command, which understands the octal form, but also has a mini-language for setting or changing the bits. See the manual page for details.
Other mode bits
For extra fun and games, look up sticky, setuid and setgid bits. These change how permission bits are interpreted. The details are intricate enough that you should read the manpages (chmod(2), stat(2)) to understand them correctly.
Post Scriptum
At this point, the metadata for this article draft looks like this:
$ stat drafts/liw-permissions.mdwn
File: `drafts/liw-permissions.mdwn'
Size: 4562 Blocks: 16 IO Block: 4096 regular file
Device: fe01h/65025d Inode: 786678 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 1000/ liw) Gid: ( 1000/ liw)
Access: 2013-11-10 16:27:54.092026928 +0000
Modify: 2013-11-10 17:02:04.994196802 +0000
Change: 2013-11-10 17:02:04.994196802 +0000
Birth: -
(Note the stat(1) command, which is another handy command line utility.)
Programming languages come in two major flavours -- those which are run directly from the source code as written (interpreted or scripting languages) and those which are first passed through a process which renders them in a form that the computer can execute directly (compiled languages). We have previously discussed some of the scripting languages and in this article we'll tackle some of the more common compiled languages.
It's worth noting here that many of the interpreted languages such as python, perl or lua have a compilation step, but the languages usually compile to a virtual machine's bytecode.
What languages will you find
C
On an average Linux-based system (or BSD for that matter) you will find a lot
of code written in the language called
C. Indeed the Linux
kernel and most BSD kernels are written in C. Most of the core operating
system utilities such as cp
, ls
etc are likely written in C unless you're
on a very odd operating system.
C is considered the canonical programming language for UNIX systems and it's recommended that everyone who writes code on UNIX be aware of and familiar with C, even if you do not actively program in it on a day-to-day basis.
Many manual pages for system calls etc are written with their examples in C.
C++
In 1979, Bjarne Stroustrup started work on a new programming language which was eventually called C++. C++ adds a lot of new syntax and semantic elements to the language of C and these days is considered an entirely separate language which happens to have some similarities to C. C++ is favoured by some large projects run by people who subscribe to the object-oriented programming paradigms very strongly.
C++ is favoured by many GUI authors. The toolkit Qt is written in a meta-language built on top of C++. Then the KDE is written on top of Qt and as such is written in C++. C++ is also favoured on Windows since it's Microsoft's chosen systems programming language.
Objective C
Objective C marries the C language with some of the properties of the Smalltalk syntax and semantics. It is favoured by object-oriented programmers who don't like C++ but want more than C provides by default. The GNUStep project is written mostly in Objective C. Objective C is also favoured on the Mac OS platforms (Mac OS, Mac OS X and the various iOS variants).
Java
While Java is, strictly speaking, a compiled language it has similarities to scripting languages in that the compiler targets a virtual machine rather than the underlying machine code of the computer it is to run on. This forms the basis of Java's claim of "Write once, run everywhere".
Java is favoured by enterprise programmers and can also be found as the language underneath Android applications. Tools such as the Jenkins CI controller or the Eclipse integrated development environment are written in Java.
C#
Many people think C# is just a Microsoft .NET
language, but with the
mono toolchain, the "Common Language
Runtime" is available to more than just Microsoft platforms. At one point
mono
was very popular for writing GTK+ based desktop applications such as
Tomboy .
Go
An increasing number of small utilities and tools are written in a language, invented by (among others) some Googlers, called Go. Go is gaining traction among systems programmers who, jaded by C++, C# etc are drawn to it by the language features seemingly designed with them in mind. While there have been a great many projects written in Go, none of them seem to be mainstream applications at the time of writing this article.
Haskell
Haskell is a very different beast of a programming language in comparison to the above. For a start, Haskell is a functional programming language rather than an imperative language of some kind.
Haskell has been around for a long time, but is only recently gaining traction as a full-power systems programming language. A surprising number of tools available on a modern UNIX system are written in Haskell. Perhaps the most well known of which are Pandoc by John Macfarlane and git-annex written by Joey Hess.
Other compiled languages
There are many many more compiled programming languages such as:
Explore the wonderful world of compiled languages and marvel at the range and variety of syntaxes all of which are meant to compile down to the same kind of machine code.