Computer programs typically need some input on which to perform their purpose. In order to ascribe meaning to the input, programs will perform a process called parsing. Depending on exactly how the author chooses to develop their program, there are a number of fundamentally different ways to convert a byte sequence to something with more semantic information layered on top.
Lexing and Parsing
Lexical analysis is the process by which a program takes a stream of bytes
and converts it to a stream of tokens. Tokens have a little more meaning,
such as taking the byte sequence "Hello"
and representing it as a token of
the form STRING
whose value is Hello
. Once a byte stream has been turned
into a token stream, the program can then parse the token stream.
Typically, the parsing process consumes the token stream and produces as
its output something like an abstract syntax tree. This AST layers enough
semantic meaning onto the input to allow the program to make use of the input
properly. As an example, in the right context, a parser might take a token
stream of the form STRING(println) '(' STRING(Hello) ')' ';'
and turn it into
an AST node of the form FunctionInvocation("println", [ "Hello" ])
. As you
can see, that would be far more useful if the program in question is a
compiler.
Parsing in this way is commonly applied when the language grammar in question meets certain rules which allow it to be expressed in such a way that a token stream can be unambiguously converted to the AST with no more than one "look-ahead" token. Such languages can convert "left-to-right" i.e. unidirectionally along the token stream and usually we call those languages LALR(1).
To facilitate easy lexical analysis and the generation of LALR(1)
parsers,
there exist a number of generator programs such as flex
and bison
, or
re2c
and lemon
. Indeed such generators are available for non-C languages
such as alex
and happy
for Haskell, or PLY
for Python.
Parsing Expression Grammars
PEGs are a type of parser which typically end up represented as a
recursive descent parser. PEGs sometimes allow for a parser to be
represented in a way which is more natural for the language definer. Further,
there is effectively infinite capability for look-ahead when using PEGs, allowing
them to parse grammars which a more traditional LALR(1)
would be unable to.
Combinatory Parsing
Parser combinators take advantage of higher order functions in programming languages to allow a parser to be built up by combining smaller parsers together into more complex parsers, until a full parser for the input can be built. The lowest level building blocks of such parsers are often called terminal recognisers and they recognise the smallest possible building block of the input (which could be a token from a lexical analyser or could be a byte or unicode character). Most parser combinator libraries offer a number of standard combinators, such as one which will recognise one or more of the passed in parser, returning the recognised elements as a list.
Sadly, due to the strong functional programming nature of combinators, it's often very hard to statically analyse the parser to check for ambiguities or inconsistencies in the grammar. These issues only tend to become obvious at runtime, meaning that if you're using parser combinators to build your parser, it's recommended that you carefully write your grammar first, and convert it to code second.
Homework
Find a program which you use, which consumes input in a form specific to the program itself. (Or find a library which is meant to parse some format) and take a deep look at how it performs lexical analysis and parsing.