Computer programs written in high or medium level languages are most often represented as strings of characters in text files. It is not possible for software to process raw sets of characters without some knowledge of what various combinations of characters forming keywords, identifiers etc (known as "tokens"), mean in terms of the language in which the program has been written. In order to do anything useful with the contents of these files, the characters need to be analysed and built into some abstract representation of the program, describing its structure, which can be processed[7][8]. Often this abstract representation is translated into some differing textual representation, e.g. the conversion from C source code to binary object code, or from Java source code to Java byte code.
Parsing involves grouping the tokens of a source program into grammatical phrases. The parser takes as an input the textual representation of the source program and constructs from it an abstract hierarchical representation of the same program, usually in the form of a tree. There are two stages to this conversion. The first is lexical analysis, which takes a stream of input characters from the textual representation of the program and converts them into a stream of tokens. The second stage is syntax analysis, where a parse tree is built from the stream of tokens which represents the program. This tree can be processed by further stages either to perform semantic analysis, for instance to check the correctness of the program, or to generate code (either machine code for a specific architecture in the case of a compiler, or source code for another high level language in the case of a translator).
As the translation from text to an abstract program representation is a problem which has been encountered many times before, it is now well understood how to construct programs to parse text into meaningful words and symbols and form so called Abstract Syntax Trees (AST's) representing the program. In fact, this problem is so common that software has been developed which will automatically generate a program to parse a text file into an AST, given a description of the language. Such software is called a "parser generator", or less accurately, a "compiler compiler". The description of the language to be parsed is usually given in the form of a "grammar", which defines what constructs (or "productions") can be used in the language, and what tokens comprise these.
There are many different pieces of parser generating software available, some of which are described in the following paragraphs.
Lex and Yacc[9] are fairly old UNIX programs. They split the parsing problem into two parts, Lex is essentially a lexical analyser (or "lexer"). It splits the source file into tokens. Yacc (Yet Another Compiler Compiler) finds the hierarchical structure of the program. The Yacc user specifies the structure of his input, together with code to be invoked as each such structure is recognised.
ANTLR[10] (Another Tool for Language Recognition) was written by Terence Parr who previously wrote PCCTS[11], but where PCCTS generates parsers written in the C language, ANTLR produces parsers written in Java or C++.
SableCC[12] is another parser generator which produces Java code. An advantage of SableCC is that the code which it
generates uses the Visitor Pattern[13]. This means that a new operation to be carried out on the AST can be added,
without changing the classes belonging to the AST, by simply writing a new class containing methods which
can be applied to each node in the AST. An instance of this class is then passed to each node's apply()
method.
The parser is generated from a grammar for the language, given as a text file. The grammar defines the tokens and
productions used in the language which need to be recognised.