A few weeks ago I introduced you to three scripting languages -- Python, Perl, and Lua. We talked about their REPLs and looked at the simple expression and output syntax along with how to exit the REPLs.
At the end of the article I encouraged you
to play around with some simple expressions and the print
statement
and see what you could achieve with your chosen language(s).
This time, I'd like to introduce you to some simple data structures available in the three scripting langauges and some simple ways to interact with them. We've already seen numbers and strings of characters, but most languages have at least one or two more basic data structures available. Almost everything else in the language will be built out of the basic data structures, so mastering them is both important and a long-winded process.
Python
In Python, there are a number of core data structures which may have syntactic support. All of them are classes, but built-in to the interpreter and language are the numbers and strings that we've encountered already, along with lists, dicts, tuples, and sets. (There are a few more such as booleans, but I trust you can look those up for yourselves.)
>>> 12 # Number
12
>>> "string"
"string"
>>> ["list","of","strings"]
['list', 'of', 'strings']
>>> {"dict":"of","things":14}
{'things': 14, 'dict': 'of'}
>>> set([1,2,3,4,5])
set([1, 2, 3, 4, 5])
>>> ("two","tuple)
('two', 'tuple')
Lists, dictionaries, sets, tuples, etc are all objects which have the requisite
class (list
, dict
, set
, tuple
) and you can define your own classes
and construct instances of them as so:
>>> class Foo:
... def demo(self):
... print "Hello World"
...
>>> Foo().demo()
Hello World
>>>
You can investigate what an object can do in python in a few ways. The most useful way from the point of view of the REPL is simply to ask for help:
>>> help([])
......paged documentation about the list class here......
Perl
Perl has a smaller set of fundamental types. Perl has numbers, strings, lists, dictionaries (which Perl calls hashes), file handles, and a few other bits and bobs. Perl is slightly different from Python in how it presents those types to the developer. Perl has what are called type sigils which, for the most part, can be used to control the kind of value you're talking about when you use variables in expressions.
$_ my $foo = 1; # The $ indicates a scalar (single) value
1$_
$_ my @bar = (1,2,3); # The @ indicates a list of values
$VAR1 = 1;
$VAR2 = 2;
$VAR3 = 3;
$_ my %baz = ("foo" => 1, "bar" => 2); # The % indicates a hash
$VAR1 = 'foo';
$VAR2 = 1;
$VAR3 = 'bar';
$VAR4 = 2;
$_
Notice how the style of re.pl
output for lists and hashes are similar? That
is because the initialiser for a hash is simply a list of the form
(key1,value1,key2,value2,...)
and the =>
operator in Perl is just a fancy
way of spelling the comma (,
).
I mentioned file handles before, but I'm sure you can go and investigate for yourself if you're fancying Perl. Perl is exceedingly widely used and very commonly supported on Linux-based operating systems; so it is a very good idea to get at least passingly familiar with how to read it.
Perl's classes are built out of the hash data type and involve a process
which Perl calls bless
ing. Have fun with that.
Lua
Lua has a number of fundamental data types we've already seen, and a few which
we've not, but along with its strings, numbers, booleans, functions, etc. Lua
has a single data type which joins together the concept of a list and a dict.
Lua has a type which it calls table
.
> = {foo = "bar"}
table: 0xbab440
> = {1, 2, 3}
table: 0xbab800
Sadly Lua's REPL doesn't expand tables for us, so we can't see inside them easily, but I'm sure you can go and look up how to look inside them if you try.
Lua's classes are built out of tables, which are a very powerful data type
once you add in a little more Lua magic called metatable
s. Enjoy looking
those up.
Challenge
This time, I'd like you to take on your favourite of the three languages we've been discussing (or perhaps all three of them if you're feeling adventurous) and have a play with the basic data types. Next time we'll talk about how you can break up your program into reusable chunks, typically called functions, procedures, or methods.
Previously we discussed data structure basics in our three scripting languages Python, Perl, and Lua. By now you should be familiar with basic expressions, outputting information, and how to structure data in your chosen of our scripting languages. This week we're going to look into how to split scripts into reusable sections of code, which are commonly called function, method, subroutine, or chunk.
These structures are useful because they allow you to create integral units of
functionality with well defined inputs and outputs (if you so choose) and then
reuse these units over and over to achieve a goal. In our example we're going
to create such a unit of code which can greet people. We shall call it
greet
.
Python
Python uses the keyword def
to indicate that the programmer is defining
a new function or method:
>>> def greet(person):
... print "Hello " + person
...
>>> greet("Yakker")
Hello Yakker
>>> greet("Geoff")
Hello Geoff
>>>
Perl
Perl uses the keyword sub
to indicate that the programmer is defining a new
subroutine:
$_ sub greet { my $person = shift; print "Hello $person\n"; }
$_ greet "Yakker"
Hello Yakker
1$_ greet "Geoff"
Hello Geoff
1$_
You may notice the apparently spurious 1
s in that... Well Perl subroutines
always have a result value of some kind, and the print
function seems to be
returning 1
for fun and chuckles.
Lua
Lua uses the keyword function
to indicate that the programmer is defining a
new function:
> function greet(person)
>> print("Hello " .. person)
>> end
> greet("Yakker")
Hello Yakker
> greet("Geoff")
Hello Geoff
>
Challenge
Functions on their own are useful, but they really come into their own when you combine them with further syntactical structures which we will explore next time. Until then, see how complex a program you can create using only the syntax we've explored thus-far. If you create anything impressive, leave a comment.
I recently had cause to debug some string handling problems in a python program.
The root cause of which was diagnosed to be a library function returning a file path as a unicode string.
I believe this to be fundamentally wrong, so I decided to write this article in the same vein as the Falsehoods programmers believe articles.
I have no illusions about this being an exhaustive list, so if you have any to add, please comment.
Paths fit in
PATH_MAX
.This seductively named constant does not mean the full length of a path.
It was only ever meant to be maximum size of an individual component.
Path components fit in
PATH_MAX
We no longer live in those times.
You can use pathconf(3) to determine this limit, but you cannot rely on it, since a different filesystem may be mounted, or a symlink changed, at any time between checking this and using it.
It is better to dynamically allocate memory for strings on the heap in a realloc(3) loop.
Two files with different paths refer to different files
Apart from
/..
being the same path as/
, hard links, symbolic links and bind-mounts all allow different file names to resolve to the same file.Two files with the same path refer to the same file
The same process at different times may see a different file, since another process may have moved it, changed a symbolic link, or mounted a filesystem.
The process itself may have used chroot(2) to change its view of the filesystem.
Different processes can have different views of the filesystem at the same time, since they may occupy different mount namespaces, different roots, or refer to different paths in proc(5), which shows different values of
/proc/self
to each process, and shows different processes in different process namespaces.All files have visibly distinct file paths
When file paths are interpreted as unicode different sequences of bytes can produce identical looking strings.
File paths are case insensitive
File paths are case sensitive
Paths are case sensitive in POSIX file systems, and insensitive on Mac and Windows file systems.
File paths are unicode
File paths have an encoding
Under Linux, file paths are any sequence of bytes terminated with a
NUL
.File paths have no encoding
Under Windows, file paths are Unicode.
File paths cannot contain whitespace
Shell scripts often get this wrong, but there's nothing preventing you putting spaces in.
File paths cannot contain
:
characters.POSIX shells use
:
as the path separator.make
uses:
to separate build targets from dependent rules. Old versions oftar
would interpret a file paths with:
in file names as a remote tape address.File paths cannot contain
*
or?
charactersIf you make use of your shell's globbing feature, you need to escape or quote glob characters.
File paths can contain
*
or?
charactersWindows does not allow you to create files with glob characters in.
Paths may contain only printable characters.
You can have fun and embed a newline character in a file name, which makes old versions of ls(1) appear to print two file names.
File path components can contain any string
The path separator cannot be part of a path component (excepting filesystem corruption).
The names
.
and..
are reserved for current directory and previous.File path components can contain any string except
"."
,".."
, or"/"
.Windows reserves CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.
File paths have 1
.
separating the name from the file extension.You may have a file without an extension, and you may have multiple
.
.File name extensions are 3 characters long.
Path components are separated with
/
.Windows used to only support
\
, now it supports it in addition. More obscure operating systems like RISC OS use.
as a path separator.Absolute paths start with a
/
.Windows has drive path (e.g.
C:\
), and UNC paths which start with\\
.foo
andfoo/../foo
always point to the same directory.If the first
foo
is a symbolic link, then following it takes you to the directory it is in. The..
takes you to the parent directory of that, which may contain an entirely different directory calledfoo
.Symbolic links point to a file that exists.
You can put any text that is also a valid file path in a symbolic link. That text may not refer to a file that currently exists. This is called a dangling symbolic link.
Symbolic links that don't point to a file that exists are dangling.
There're magic symlinks in proc(5), that when read with readlink(2) display an ID and type, which if you were to pass to open(2) would create a new file, but if you open(2)'d the file directly would give something else.
If you are writing a game, a simulation, a statistical model or, heaven
forbid, rolling your own encryption, then you will probably want some
random numbers. There is no way for your computer to provide random numbers in
the same way that you can from a fair coin-toss or dice-roll but there are
many ways to get what are called pseudorandom numbers. Pseudorandom numbers
are not truly random but they are chaotic and difficult to predict.
Linux provides random numbers in two ways, via /dev/random
and
/dev/urandom
.
The character device /dev/urandom
will give a constant stream of
pseudorandom numbers on demand. Try executing od /dev/urandom
in your
terminal to see the random bytes it outputs. The device can output a stream of
numbers indefinitely.
The other random number device /dev/random
is used differently. If you do
od /dev/random
you will notice it output numbers for a while and then
stop. This is because it generates its numbers by drawing from a 'pool' of
randomness that is kept in the kernel. This pool collects events which are
considered random, such as time between keystrokes and mouse movements. When
/dev/random
wants to generate a number it takes some entropy from the
pool and use it to make the pseudorandom number it generates less
predictable. However, once the pool has run out of entropy /dev/random
will stop outputting numbers until some more entropy is available. Try running
od /dev/random
until the numbers stop and then watch then start up again
by jiggling your mouse around. If you have ever generated a large gpg key
then you will know what it is like to run out of entropy and need to jiggle
your mouse around for a while. There do exist other ways to refill your
entropy pool.
Incidentally, /dev/urandom
also uses entropy from the same pool if entropy
is available but if none is available then numbers will be generated using an
iterative process with no entropy involved.
It is useful to have an idea of how linux does random numbers but when writing
a program it is unlikely that you will use /dev/random
or /dev/urandom
directly; practically all modern programming languages provide you with some
way of getting random numbers. Python, for example, asks /dev/urandom
for random numbers when you use the random
module. It is also worth
mentioning that a lot of programming languages will have their own
pseudorandom number generators and not rely on the Linux kernel at all. Such
generators are many and varied and will the covered in a future article.