Daniel Silverstone Processing input

Computer programs typically need some input on which to perform their purpose. In order to ascribe meaning to the input, programs will perform a process called parsing. Depending on exactly how the author chooses to develop their program, there are a number of fundamentally different ways to convert a byte sequence to something with more semantic information layered on top.

Lexing and Parsing

Lexical analysis is the process by which a program takes a stream of bytes and converts it to a stream of tokens. Tokens have a little more meaning, such as taking the byte sequence "Hello" and representing it as a token of the form STRING whose value is Hello. Once a byte stream has been turned into a token stream, the program can then parse the token stream.

Typically, the parsing process consumes the token stream and produces as its output something like an abstract syntax tree. This AST layers enough semantic meaning onto the input to allow the program to make use of the input properly. As an example, in the right context, a parser might take a token stream of the form STRING(println) '(' STRING(Hello) ')' ';' and turn it into an AST node of the form FunctionInvocation("println", [ "Hello" ]). As you can see, that would be far more useful if the program in question is a compiler.

Parsing in this way is commonly applied when the language grammar in question meets certain rules which allow it to be expressed in such a way that a token stream can be unambiguously converted to the AST with no more than one "look-ahead" token. Such languages can convert "left-to-right" i.e. unidirectionally along the token stream and usually we call those languages LALR(1).

To facilitate easy lexical analysis and the generation of LALR(1) parsers, there exist a number of generator programs such as flex and bison, or re2c and lemon. Indeed such generators are available for non-C languages such as alex and happy for Haskell, or PLY for Python.

Parsing Expression Grammars

PEGs are a type of parser which typically end up represented as a recursive descent parser. PEGs sometimes allow for a parser to be represented in a way which is more natural for the language definer. Further, there is effectively infinite capability for look-ahead when using PEGs, allowing them to parse grammars which a more traditional LALR(1) would be unable to.

Combinatory Parsing

Parser combinators take advantage of higher order functions in programming languages to allow a parser to be built up by combining smaller parsers together into more complex parsers, until a full parser for the input can be built. The lowest level building blocks of such parsers are often called terminal recognisers and they recognise the smallest possible building block of the input (which could be a token from a lexical analyser or could be a byte or unicode character). Most parser combinator libraries offer a number of standard combinators, such as one which will recognise one or more of the passed in parser, returning the recognised elements as a list.

Sadly, due to the strong functional programming nature of combinators, it's often very hard to statically analyse the parser to check for ambiguities or inconsistencies in the grammar. These issues only tend to become obvious at runtime, meaning that if you're using parser combinators to build your parser, it's recommended that you carefully write your grammar first, and convert it to code second.


Find a program which you use, which consumes input in a form specific to the program itself. (Or find a library which is meant to parse some format) and take a deep look at how it performs lexical analysis and parsing.

Posted Thu Feb 1 12:00:09 2018
Lars Wirzenius Don't burn that bridge!

You may be familiar with some variant of this scenario:

You're on a mailing list (or web forum or Google group or whatever), where some topic you're interested in is being discussed. You see someone saying something you think is wrong. You fire off a quick reply telling them they're wrong, and move on to the next topic.

Later on, you get a reply, and for some reason they are upset at you telling them they're wrong, and you get upset at how rude they are, so you send another quick reply, putting them in their place. Who do they think they are, spouting off falsehoods and being rude about it.

The disagreement spirals and becomes hotter and more vicious each iteration. What common ground there was in the beginning is soon ruined by trenches, bomb craters, and barbed wire. Any bridges between the parties are on fire. There's no hope for peace.

This is called a flame war. It's not a good thing, but it's not uncommon in technical discussions on the Internet. Why does it happen and how can you avoid it?

As someone covered in scars of many a flame war, here are my observations (entirely unsubstantiated by sources):

  • Flame wars happen because people try to be seen as being more correct than others, or to be seen to win a disagreement. This often happens online because the communication medium lacks emotional bandwidth. It is difficult to express subtle emotions and cues over a text-only channel, especially, or any one-way channel.

    Disagreements spiral away more rarely in person, because in-person communication contains a lot of unspoken parts, which signal things like someone being upset, before the thing blows up entirely. In text-only communication, one needs to express such cues more explicitly, and be careful when reading to spot the more subtle cues.

  • In online discussions around free software there are also often no prior personal bonds between participants. Basically, they don't know each other. This makes it harder to understand each other.

  • The hottest flame wars tend to happen in contexts where the participants have the least to lose.

Some advice (again, no sources):

  • Try hard to understand the other parties in a disagreement. The technical term is empathy. You don't need to agree with them, but you need to try to understand why they say what they say and how they feel. As an example, I was once in a meeting where a co-worker arrived badly late, and the boss was quite angry. It was quickly spiralling into a real-life flame war, until someone pointed out that the boss was upset because he needed to get us developers do certain things, and people being late was making that harder to achieve, and at the same time the co-worker who was late was mourning his dog who'd been poorly for years and had recently committed suicide by forcing open a 6th floor window and jumping out.

  • Try even harder to not express anger and other unconstructive feelings, especially by attacking the other parties. Instead of "you're wrong, and you're so stupid that the only reason you don't suffocate is because breathing is an autonomous action that doesn't require the brain, go jump into a frozen lake", say something like "I don't agree with you, and I'm upset about this discussion so I'm going to stop participating, at least for a while". And then don't participate further.

  • Do express your emotions explicitly, if you think that'll mean others will understand you better.

  • Try to find at least something constructive to say, and some common ground. Just because someone is wrong about what the colour of the bike shed should be, doesn't mean you have to disagree whether a bike shed is useful.

  • Realise that shutting up doesn't mean you agree with the other parties in a disagreement, and it doesn't mean you "lose" the argument.

  • Apply rule 6 vigorously: write angry responses if it helps you deal with your emotions, but don't send them. You can then spend the rest of you life being smug about how badly other people have been humiliated and shown to be wrong.

Your homework for this week, should you choose to accept it, is to find an old flame war and read through it and see where the participants could've said something different and defuse the situation. You get bonus points if it's one which you've participated in yourself.

Posted Wed Feb 7 12:00:09 2018 Tags:

The title of this article is intentionally provocative.

Git is a flexible tool that allows many kinds of workflow for using it. Here is the workflow I favour for teams:

  • The master branch is meant to be always releasable.

  • Every commit in master MUST pass the full test suite, though not all commits in merged change sets need to do that.

  • Changes are done in dedicated branches, which get merged to master frequently - avoid long-lived branches, since they tend to result in much effort having to be spent on resolving merge conflicts.

    • If frequent merging is, for some reason, not an option, at least rebase the branch onto current master frequently: at least daily. This keeps conflicts fairly small.
  • Before merging a branch into master, rebase it onto master and resolve any conflicts - also rebase the branch so it tells a clean story of the change.

    • git rebase -i master is a very powerful tool. Learn it.

    • A clean story doesn't have commits that fix mistakes earlier in the branch-to-be-merged, and introduces changes within the branch in chunks of a suitable size, and in an order that makes sense to the reader. Clean up "Fix typo in previous commit" type of commits.

  • Update the NEWS file when merging into master. Also Debian packaging files, if those are included in the source tree.

  • Tag releases using PGP signed, annotated tags. I use a tool called bumper, which updates NEWS, version.py, debian/changelog, tags a release, and updates the files again with with +git appended to version number.

    • Review, update NEWS, debian/changelog before running bumper to make sure they're up to date.
  • Name branches and tags with a prefix foo/ where foo is your username, handle, or other identifier.

  • If master is broken, fixing it has highest priority for the project.

  • If there is a need for the project to support older releases, create a branch for each such, when needed, starting from the release's tag. Treat release branches as master for that release.

Posted Wed Feb 21 12:00:11 2018 Tags: