A few weeks ago I introduced you to three scripting languages -- Python, Perl, and Lua. We talked about their REPLs and looked at the simple expression and output syntax along with how to exit the REPLs.

At the end of the article I encouraged you to play around with some simple expressions and the print statement and see what you could achieve with your chosen language(s).

This time, I'd like to introduce you to some simple data structures available in the three scripting langauges and some simple ways to interact with them. We've already seen numbers and strings of characters, but most languages have at least one or two more basic data structures available. Almost everything else in the language will be built out of the basic data structures, so mastering them is both important and a long-winded process.

Python

In Python, there are a number of core data structures which may have syntactic support. All of them are classes, but built-in to the interpreter and language are the numbers and strings that we've encountered already, along with lists, dicts, tuples, and sets. (There are a few more such as booleans, but I trust you can look those up for yourselves.)

>>> 12 # Number
12
>>> "string"
"string"
>>> ["list","of","strings"]
['list', 'of', 'strings']
>>> {"dict":"of","things":14}
{'things': 14, 'dict': 'of'}
>>> set([1,2,3,4,5])
set([1, 2, 3, 4, 5])
>>> ("two","tuple)
('two', 'tuple')

Lists, dictionaries, sets, tuples, etc are all objects which have the requisite class (list, dict, set, tuple) and you can define your own classes and construct instances of them as so:

>>> class Foo:
...     def demo(self):
...         print "Hello World"
...
>>> Foo().demo()
Hello World
>>>

You can investigate what an object can do in python in a few ways. The most useful way from the point of view of the REPL is simply to ask for help:

>>> help([])
......paged documentation about the list class here......

Perl

Perl has a smaller set of fundamental types. Perl has numbers, strings, lists, dictionaries (which Perl calls hashes), file handles, and a few other bits and bobs. Perl is slightly different from Python in how it presents those types to the developer. Perl has what are called type sigils which, for the most part, can be used to control the kind of value you're talking about when you use variables in expressions.

$_ my $foo = 1;  # The $ indicates a scalar (single) value
1$_
$_ my @bar = (1,2,3); # The @ indicates a list of values
$VAR1 = 1;
$VAR2 = 2;
$VAR3 = 3;
$_ my %baz = ("foo" => 1, "bar" => 2); # The % indicates a hash
$VAR1 = 'foo';
$VAR2 = 1;
$VAR3 = 'bar';
$VAR4 = 2;
$_

Notice how the style of re.pl output for lists and hashes are similar? That is because the initialiser for a hash is simply a list of the form (key1,value1,key2,value2,...) and the => operator in Perl is just a fancy way of spelling the comma (,).

I mentioned file handles before, but I'm sure you can go and investigate for yourself if you're fancying Perl. Perl is exceedingly widely used and very commonly supported on Linux-based operating systems; so it is a very good idea to get at least passingly familiar with how to read it.

Perl's classes are built out of the hash data type and involve a process which Perl calls blessing. Have fun with that.

Lua

Lua has a number of fundamental data types we've already seen, and a few which we've not, but along with its strings, numbers, booleans, functions, etc. Lua has a single data type which joins together the concept of a list and a dict. Lua has a type which it calls table.

> = {foo = "bar"}
table: 0xbab440
> = {1, 2, 3}
table: 0xbab800

Sadly Lua's REPL doesn't expand tables for us, so we can't see inside them easily, but I'm sure you can go and look up how to look inside them if you try.

Lua's classes are built out of tables, which are a very powerful data type once you add in a little more Lua magic called metatables. Enjoy looking those up.

Challenge

This time, I'd like you to take on your favourite of the three languages we've been discussing (or perhaps all three of them if you're feeling adventurous) and have a play with the basic data types. Next time we'll talk about how you can break up your program into reusable chunks, typically called functions, procedures, or methods.

Posted Wed Jan 6 12:00:07 2016
Daniel Silverstone Chunks of scripting

Previously we discussed data structure basics in our three scripting languages Python, Perl, and Lua. By now you should be familiar with basic expressions, outputting information, and how to structure data in your chosen of our scripting languages. This week we're going to look into how to split scripts into reusable sections of code, which are commonly called function, method, subroutine, or chunk.

These structures are useful because they allow you to create integral units of functionality with well defined inputs and outputs (if you so choose) and then reuse these units over and over to achieve a goal. In our example we're going to create such a unit of code which can greet people. We shall call it greet.

Python

Python uses the keyword def to indicate that the programmer is defining a new function or method:

>>> def greet(person):
...     print "Hello " + person
...
>>> greet("Yakker")
Hello Yakker
>>> greet("Geoff")
Hello Geoff
>>>

Perl

Perl uses the keyword sub to indicate that the programmer is defining a new subroutine:

$_ sub greet { my $person = shift; print "Hello $person\n"; }
$_ greet "Yakker"
Hello Yakker
1$_ greet "Geoff"
Hello Geoff
1$_ 

You may notice the apparently spurious 1s in that... Well Perl subroutines always have a result value of some kind, and the print function seems to be returning 1 for fun and chuckles.

Lua

Lua uses the keyword function to indicate that the programmer is defining a new function:

> function greet(person)
>> print("Hello " .. person)
>> end
> greet("Yakker")
Hello Yakker
> greet("Geoff")
Hello Geoff
>

Challenge

Functions on their own are useful, but they really come into their own when you combine them with further syntactical structures which we will explore next time. Until then, see how complex a program you can create using only the syntax we've explored thus-far. If you create anything impressive, leave a comment.

Posted Wed Jan 13 12:00:07 2016

I recently had cause to debug some string handling problems in a python program.

The root cause of which was diagnosed to be a library function returning a file path as a unicode string.

I believe this to be fundamentally wrong, so I decided to write this article in the same vein as the Falsehoods programmers believe articles.

I have no illusions about this being an exhaustive list, so if you have any to add, please comment.

  • Paths fit in PATH_MAX.

    This seductively named constant does not mean the full length of a path.

    It was only ever meant to be maximum size of an individual component.

  • Path components fit in PATH_MAX

    We no longer live in those times.

    You can use pathconf(3) to determine this limit, but you cannot rely on it, since a different filesystem may be mounted, or a symlink changed, at any time between checking this and using it.

    It is better to dynamically allocate memory for strings on the heap in a realloc(3) loop.

  • Two files with different paths refer to different files

    Apart from /.. being the same path as /, hard links, symbolic links and bind-mounts all allow different file names to resolve to the same file.

  • Two files with the same path refer to the same file

    The same process at different times may see a different file, since another process may have moved it, changed a symbolic link, or mounted a filesystem.

    The process itself may have used chroot(2) to change its view of the filesystem.

    Different processes can have different views of the filesystem at the same time, since they may occupy different mount namespaces, different roots, or refer to different paths in proc(5), which shows different values of /proc/self to each process, and shows different processes in different process namespaces.

  • All files have visibly distinct file paths

    When file paths are interpreted as unicode different sequences of bytes can produce identical looking strings.

  • File paths are case insensitive

  • File paths are case sensitive

    Paths are case sensitive in POSIX file systems, and insensitive on Mac and Windows file systems.

  • File paths are unicode

  • File paths have an encoding

    Under Linux, file paths are any sequence of bytes terminated with a NUL.

  • File paths have no encoding

    Under Windows, file paths are Unicode.

  • File paths cannot contain whitespace

    Shell scripts often get this wrong, but there's nothing preventing you putting spaces in.

  • File paths cannot contain : characters.

    POSIX shells use : as the path separator. make uses : to separate build targets from dependent rules. Old versions of tar would interpret a file paths with : in file names as a remote tape address.

  • File paths cannot contain * or ? characters

    If you make use of your shell's globbing feature, you need to escape or quote glob characters.

  • File paths can contain * or ? characters

    Windows does not allow you to create files with glob characters in.

  • Paths may contain only printable characters.

    You can have fun and embed a newline character in a file name, which makes old versions of ls(1) appear to print two file names.

  • File path components can contain any string

    The path separator cannot be part of a path component (excepting filesystem corruption).

    The names . and .. are reserved for current directory and previous.

  • File path components can contain any string except ".", "..", or "/".

    Windows reserves CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.

  • File paths have 1 . separating the name from the file extension.

    You may have a file without an extension, and you may have multiple ..

  • File name extensions are 3 characters long.

  • Path components are separated with /.

    Windows used to only support \, now it supports it in addition. More obscure operating systems like RISC OS use . as a path separator.

  • Absolute paths start with a /.

    Windows has drive path (e.g. C:\), and UNC paths which start with \\.

  • foo and foo/../foo always point to the same directory.

    If the first foo is a symbolic link, then following it takes you to the directory it is in. The .. takes you to the parent directory of that, which may contain an entirely different directory called foo.

  • Symbolic links may not be empty

  • Symbolic links point to a file that exists.

    You can put any text that is also a valid file path in a symbolic link. That text may not refer to a file that currently exists. This is called a dangling symbolic link.

  • Symbolic links that don't point to a file that exists are dangling.

    There're magic symlinks in proc(5), that when read with readlink(2) display an ID and type, which if you were to pass to open(2) would create a new file, but if you open(2)'d the file directly would give something else.

Posted Wed Jan 20 12:00:06 2016 Tags:

If you are writing a game, a simulation, a statistical model or, heaven forbid, rolling your own encryption, then you will probably want some random numbers. There is no way for your computer to provide random numbers in the same way that you can from a fair coin-toss or dice-roll but there are many ways to get what are called pseudorandom numbers. Pseudorandom numbers are not truly random but they are chaotic and difficult to predict. Linux provides random numbers in two ways, via /dev/random and /dev/urandom.

The character device /dev/urandom will give a constant stream of pseudorandom numbers on demand. Try executing od /dev/urandom in your terminal to see the random bytes it outputs. The device can output a stream of numbers indefinitely.

The other random number device /dev/random is used differently. If you do od /dev/random you will notice it output numbers for a while and then stop. This is because it generates its numbers by drawing from a 'pool' of randomness that is kept in the kernel. This pool collects events which are considered random, such as time between keystrokes and mouse movements. When /dev/random wants to generate a number it takes some entropy from the pool and use it to make the pseudorandom number it generates less predictable. However, once the pool has run out of entropy /dev/random will stop outputting numbers until some more entropy is available. Try running od /dev/random until the numbers stop and then watch then start up again by jiggling your mouse around. If you have ever generated a large gpg key then you will know what it is like to run out of entropy and need to jiggle your mouse around for a while. There do exist other ways to refill your entropy pool.

Incidentally, /dev/urandom also uses entropy from the same pool if entropy is available but if none is available then numbers will be generated using an iterative process with no entropy involved.

It is useful to have an idea of how linux does random numbers but when writing a program it is unlikely that you will use /dev/random or /dev/urandom directly; practically all modern programming languages provide you with some way of getting random numbers. Python, for example, asks /dev/urandom for random numbers when you use the random module. It is also worth mentioning that a lot of programming languages will have their own pseudorandom number generators and not rely on the Linux kernel at all. Such generators are many and varied and will the covered in a future article.

Posted Wed Jan 27 12:00:07 2016 Tags: