Translation


A large part of the work in developing the Kenya system was developing a way to translate Kenya code into Java code. Although this is a mechanical translation, I wanted the Java code that was produced to look as "human generated" as possible. It should be as close as possible to the code that a human programmer would write to solve the problem in Java.

There should perhaps be a caveat here. Java is an object-oriented language, and the way in which "good" (or certainly large) programs are written in Java normally makes much use of this. Kenya is not an object-oriented language, as although there is provision for defining classes and creating instances of these classes, there is no inheritance, and therefore no type polymorphism. (As it happens, this is the same as Visual Basic, which Microsoft claims is object-oriented, but I would argue against this). It is suggested that novice programmers should not be introduced immediately to the object-oriented programming paradigm, but that they should begin with a more traditional, structured approach. Because of this, the Kenya system will generate the Java code that a programmer would write to solve a problem if they were approaching it in a structured way, rather than an object-oriented way. There is not a straightforward mapping between a structured program and an object-oriented program to solve the same problem, as they are two different ways of thinking about solving the problem. This makes the task of writing a mechanical translator from structured code to object-oriented code extremely difficult. It also means that, were such a translation achievable, a novice programmer looking at the generated code would not easily be able to see, based on the way that they had been thinking about the problem, how the Java code worked.

To carry out the translation, first the text representing the Kenya program must be processed to form an abstract representation. Then the abstract representation can be processed, generating the relevant Java code from each construct and expression in the program.

Parsing

The parser for Kenya was generated using SableCC[12], as it outputs Java code which is easy to incorporate with the rest of the code for the project. Also, the use of the Visitor pattern allows for easy extension of the functionality defined over the AST without having to modify any of the generated code by hand, and also leads to an elegant solution.

Writing the Grammar

A grammar for SableCC is written using Extended Bachus Naur Form (EBNF)[17]. The grammar has the following sections (examples are given of the sorts of definition which might appear in each section):

Helpers
  digit = ['0'..'9'];
  
Tokens
  minus = '-';
  plus = '+';
  blank = ' ';
  number = digit+;

Ignored Tokens
  blank;

Productions

  expression =
    {plus}   expression plus number  |
    {minus}  expression minus number |
    {number} number; 

Each of the constructs with an equals sign in the middle defines the name on the left hand side to represent whatever is on the right hand side. In the case of digit, this is the set of digits 0 to 9. The tokens give a name to each of the characters which may appear in the program source. Operators such as + can be used, as in the case of number. The use of + says that a number is a sequence of one or more digits.

Productions represent the way in which tokens can be combined to write programs. They can be defined recursively as shown in the example. There are several alternatives for what an expression may be. It may either be a number or an expression to which a number is added or subtracted.

Once a grammar has been written, SableCC will try to generate a lexical analyser and a parser for the language. Quite often during the writing of a grammar SableCC will report that it cannot generate a parser as it cannot resolve precisely which production a set of tokens represents. This is a shift/reduce or reduce/reduce error and is caused by an ambiguity in the grammar.

SableCC produces a so-called "bottom up parser". This means that the parser continues from left to right along a stream of tokens, pushing the tokens on to a stack. At each stage it tries to determine whether the elements at the top of the stack represent an entire production, or whether more tokens need to be examined in order to make an unequivocal decision as to what the string of tokens represents. If the top elements of the stack do represent a production, those elements are "reduced" exchanging the set of tokens for a symbol representing the production as a whole. If there is not yet enough information to make a decision then the next token in the stream is "shifted" on to the top of the stack and the process continues.

If at any point the parser is in a situation where it has multiple options and cannot make a decision based only on the contents of the stack and the next token, then there is a "shift/reduce conflict" (if the parser is unsure whether to shift or reduce) or a "reduce/reduce conflict" (if there are several possible reductions which could be made). See the Dragon book[8] for more details on bottom up parsing techniques.

When SableCC generates a parser, it checks whether either of these conflicts can arise. If they can then it will not complete the generation of the parser. The productions in the grammar must be rewritten until the parser will no longer be able to enter such a conflict state. Such conflicts are generally very frustrating and not straightforward to remove or prevent. A great deal of time was spent during this project working on the grammar to remove conflicts.

Analysis

SableCC takes a grammar definition and produces from it a Node class and a set of subclasses of it which represent each of the operators and constructs which can appear at each of the nodes in the syntax tree. It produces a lexical analyser and a parser which together process the textual representation of the source code to produce this tree. Everything which is a Node has a method with the signature public void apply( Object o ). This allows visitor objects to be passed to each node to perform a particular type of analysis or processing on them.

SableCC generates a class called DepthFirstAdapter which will perform a depth first traversal of the parse tree, without performing any operations on the nodes as it goes. This class can be extended and its methods overridden to process the tree.

The methods in DepthFirstAdapter have a signature of the following form:

public void caseAVariableDeclaration( AVariableDeclaration node )
{
    // do something
}

AVariableDeclaration is a subclass of Node. There is a caseXY method corresponding to each type of node which may be encountered during the traversal of the tree. The relevant case method is called depending on the type of the Node to which the DepthFirstAdapter is passed.

Most Nodes in the tree will have children. What children they have depends on the grammar. For example, the grammar might contain the following definition:

variable_declaration =  type identifier initialiser? semicolon;

Here we can expect a node of type AVariableDeclaration to have children representing exactly one type, one identifier, zero or one initialisers and a semicolon. This might be processed in the following way:

public void caseAVariableDeclaration ( AVariableDeclaration node )
{
    addToSymbolTable( node.getIdentifier() , node.getType() );

    if  ( node.getInitialiser() != null ) {
         node.getInitialiser().apply( this );
    }

   // control returns here after the tree representing 
   // the initaliser has been processed
}

Here we see that get methods are provided for all the possible children of the node. In the case of the initialiser, as it is optional, a test is done to see if there is one. If there is, the visitor is passed on to process the tree representing the initialiser (an initaliser will probably consist of at least an assignment operator and an expression, which may in itself be a complicated set of nodes) by calling the node's apply() method and passing it the DepthFirstAdapter. (i.e. this.) Programmers unfamiliar with the Visitor pattern may find it slightly counterintuitive that the this reference refers to the DepthFirstAdapter rather than the Node that is being processed. After the tree representing the initialiser has been processed, the flow of control returns to this node, as indicated by the comment in the code fragment above, showing that the tree is processed recursively.

Code Generation

Once the abstract syntax tree has been generated, it can be processed by the code generator to produce equivalent code in a different language, in this case Java (with GJ extensions, see the language design section).

The main difference between a Kenya program and a Java program is that a Kenya program comprises a list of statements, whereas a Java program consists of a set of class definitions, and has the following structure:

One of the methods defined must be called main. This is the point of entry into the program. In Kenya, the point of entry is simply the top of the program.

The main issues of concern when generating Java code from a Kenya program are therefore:

Other pertinent issues are:

Style

One of the most difficult things to decide was how to write Java code in the style of a human. Different people have different coding styles, and there is no one "best" style. Style is concerned both with the layout of keywords on the screen (i.e. what the code looks like) and the way that constructs are defined and combined.

In terms of layout, a consistent layout was used, always starting blocks with the curly bracket on a new line after an if or a class keyword. In terms of the way in which the program itself is structured, this is largely going to be due to the way in which the user has solved their problem. However, there are a few Java idioms which can be applied when generating the Java code.

In a procedural language, a constant would be declared just like any other variable (possibly at the top of the file). A Java programmer is more likely to define thier constants as being static final members of the class which they are writing. If a direct, line by line translation was made from Kenya to Java then constants would be defined in the body of the main method. The Java code would still compile, but it would not be a good example of how to write Java programs. To avoid this, in the Kenya implementation all of the constant declarations are pulled out of a Kenya program and defined as attributes of the class.

The placement of different elements within a Java program is also a matter of style. In the generated code the following pattern is followed:

class Classname {

  constants

  main method

  other methods
}

class Record
{
   attributes
}

Having an order for generating the code for different elements which is not necessarily the same as the order in which they are defined in the Kenya program means that the nodes of the abstract syntax tree cannot simply be processed in order. Account must be taken of what the nodes represent and the tree must be rearranged to meet the required structure.

Implementing the Translator

The translator was implemented by writing a visitor which could be passed to the apply() method of each node of the syntax tree produced by the parser. The class Translator extends the class DepthFirstAdapter created by SableCC. The DepthFirstAdapter performs a depth first traversal of the tree, doing nothing to the nodes as it walks over them. By extending this class and overriding its methods, a visitor was created which traverses the tree and generates the Java code relevant to each node.

Initially the approach taken was to simply generate the code

public class Program
{
    public static void main ( String[] args )
    {

and then traverse the tree, generating the code for each statement and appending it to the program. When the traversal was complete, the program was completed by closing the method and the class

    } 
}

I decided to call the main class Program in all cases. Class names are generally decided by the purpose of the code being written. It is not possible for a machine to infer from the Kenya code what the intent of the program is and summarise it with a suitable name, so I decided to stick with Program.

Function Definitions

For simple programs the above approach works well. However, when a function is defined in the Kenya program, its definition cannot simply be written out in the flow of the Java code. Doing this would put one method definition nested inside another. Java does not allow this as all methods have to be members of a class, even if they are static and can therefore be called without creating an instance of the class. The code which needs to be generated is of the following form:

public class Program
{
    public static void main ( String[] args )
    {
         statements        
    }

    static void doSomething()
    {
         statements
    }
}

To achieve this, whenever a function definition is encountered, instead of continuing down that branch of the tree and generating the code for the method definition, the branch is copied into a list of function definitions, and the code generator moves on to deal with the next branch of the tree. At the end, the main is closed, but before the class definition is closed, the translator object processes each function definition which has been accumulated in the list individually.

All the methods for which code is generated are declared to be static, as an object of type Program will never be created. However, only main is declared as being public, as it is the only method which will be called from outside the package (by the Java Virtual Machine when the program is invoked). All other methods will be called from inside the same class, or the same package, and so can (and should) be left "friendly", without an access modifier.

Class Definititions

A similar issue to that raised by function definitions is raised by class definitions. In Java it is permitted to nest one class definition inside another, producing a so-called "inner class". However, it seems clearer (to me at least) to have all classes defined at the same level. There are situations in which inner classes are useful. These concern callbacks and cases in which multiple implementation inheritance is required (see ch 8 of Eckel[18] for a discussion of these) but these issues are far more complicated than anything that should concern an introductory programmer and I feel that the use of inner classes should be reserved for these more sophisticated uses, as they are not required in solving more everyday problems. To achieve this, class declarations are added to a seperate list, and the code generator applied to them at the end, as is the case with function definitions.

Java programmers often define each class in a separate file. The compiler insists that all public classes are defined in their own file, and that the file is named after the class. Programmers often take this to mean that all classes should be put in their own file. This is also caused by programmers making any class that is not private, public. The public modifier is only necessary when a class needs to be accessed by a class in a different package. Any classes that are only accessed by classes in the same package (the majority of cases) can be defined as "friendly", with no access modifer, and can be defined in the same file. It is my belief that, particularly for simple classes, it makes much more sense to include several classes in a file. This saves having to constantly switch between files or have multiple editor windows open, and allows the whole program to be seen at once. I believe this is valuable when trying to learn how a program works.

Class Names

As mentioned above, each translation of a Kenya program to Java produces a public class called Program. In order for the Java compiler to process this class, it must be saved in a file called Program.java. This only becomes a problem when the user gets to the stage where they want to work with the Java. They will end up with multiple files all called Program.java. If they change the name then the file will not compile, but if they do not change the name then they cannot keep the files in the same directory.

The user could change the name of the class to something other than Program and change the name of the file accordingly, but even if they do this there may be other problems. When a Java program is compiled, a .class file is produced for each class that has been defined. If two Kenya programs are written which each define a class Dog, these will translate to two Java programs which each define a class Dog. The user can change the name of the main class from Program to something else so that they can keep the two programs in the same directory for convenience. However, when they compile one of the programs using the Java compiler, this will create a file called Dog.class which will overwrite any Dog.class files produced by previous compilation of the other program. This will cause the first program to execute incorrectly unless the definitions of the Dog class in both programs were the same.

These difficulties can be overcome by using Java's package system. By defining a package corresponding to each Kenya program that is written, all the .java and .class files generated from that Kenya program can be put in the same package. In the filesystem this is implemented by creating a directory, so it does not allow multiple Program.java's in the same directory, but it does provide a neat organisation, and certainly avoids the conflicts indicated by the Dog example above. The best way to get a sensible package name is to ask the user to provide one. A package statement needs to be added to the top of the Java program to indicate which package the class is in.

Input

Kenya provides the facility for obtaining input in programs by providing the functions readInt(), readReal() and readString(). These translate to somewhat complex functions in Java, as can be seen in the language design section. Although in general in designing the Kenya system it has been my principle to present the user with all the code necessary to achieve the same functionality in Java, in this case I have decided to place the functions into library classes and import them into the namespace. This does not go against the principle of generating the code that a human Java programmer would write, as a human programmer would not rewrite classes that they had written before, but would reuse the code, importing it from another package. The code for the input routines will be supplied in the language reference for those who wish to know how they were written.

As the translator traverses the syntax tree, every time it comes across a call to an input function it adds an entry to a set. If only one type of input has been used, (e.g. just readInt()), that class is imported explicitly by placing the following command at the top of the program.

import kenya.io.intReader;

If more than one type of input has been used then the whole of the kenya.io package is imported into the namespace instead using the command

import kenya.io.*;