Thursday, August 18, 2011

Manipulating Java Class Files with BCEL - Part Two: Expressions

This is the second article in the series of articles on Apache BCEL. If you have not read part 1, read it here.

Expression Processing: Expressions are key part of a language. In this article, I will discuss how expressions are compiled into java byte code. I will also cover a little bit about compilation process. At the end, I will go through the steps and create a compiler for numerical expression.

How Java Codes Expressions: As I discussed in part 1, java works with an operand stack. A simple expression 2+3 will turn into the following (assuming 2 is stored in constant number 3 and 3 is stored in constant number 3).
  1. ldc #3
  2. ldc #6
  3. iadd

The iadd instruction will pop its two operands, add them and them push them back into the stack. The following program does exactly this (but also contains code to do the output).

If you run the generated class now, it should output 5. More complex expressions are processed through properly manipulating the stack operations.

Prefix, Infix, Postfix Operators: Infix operators are the ones we normally use in normal calculation. For example 2+3. A prefix operator comes before the operands, like this: + 2 3. Similarly, a postfix operator comes after the operands, 2 3 +. The advantage of using prefix or postfix operators is that there is no need for parentheses. For example, the expression 2+3*8+(9-2) becomes
2 3 8 * + 9 2 - +
with postfix operators. Note that even operator precedence is not required. You may be wondering why I am talking about all these here. The simple reason is that if you write your expression in postfix notation, it will represent the JVM instructions needed to process the expression. The following shows the instructions need to do the above calculation. The parentheses represents a constant pool index that contains the value written inside.
  • ldc (2)
  • ldc (3)
  • ldc (8)
  • imul
  • iadd
  • ldc (9)
  • ldc (2)
  • isub
  • iadd

This exactly represents the postfix operator expression shown above. But postfix operator notations are difficult to understand for humans. So, how do we convert infix operator notation to prefix operator notation? For that we need to learn a little about how a compiler works.

Little Bit on Compilation: Although, steps may vary from compiler to compiler, a compilation process involve pretty much the following steps.
  1. Lexing
  2. Parsing
  3. Semantic Analysis
  4. Intermediate Code Generation
  5. Optimization
  6. Code Generation
This is not an article on compilation, so I will keep it as simple as possible.

Lexing: Lexing is the process of dividing the expression in a number of tokens. tokens are like words in a sentence. For example, 12+3*114 will be divided into tokens 12,+,3,*,114. All lexers are invariably deterministic finite state atomata or DFA. A DFA is a simple program (or any other thing) that has a set of posible states, and a set of posible transitions from one state to another triggered by an input symbol. A DFA thus processes a stream of input symbols, hopping from one state to another in the process. If a symbol is encountered for which no transitions are defined, its an error.

A state can be either an accepting state or a non-accepting state (but not a rejecting state). Whenever the DFA hits an accepting state, a token is created from the symbols already read. If at the end of the symbol stream, a non-accepting state is reached, it is an error.

The above picture shows the DFA for our lexer. The states Begin and Space are accepting states shown by double elipse. It might look a little too complicated at the begining, but if seen carefully, its pretty simple. Alphabets and $,_ start a variable; aplhanumeric and $,_ continue it. 0-9, . start and continue creating a number. Everything else is even simpler. The lex() function in the program given below implements this logic.

Parsing: Parsing is the process of arranging the tokens in something called a parse tree. The parse tree represents the order of evaluation. The leaves are evaluated first, and then the branches. Hence, the parse tree is later post-order traversed to create the instruction set. This is same as using the postfix operators.

A program's structure can be defined using a context free grammer (CFG). A CFG is a set of terminals, non-terminals and productions. Terminals are the symbols having one-to-one correspondence with the tokens generated in the lexer. Note here, the symbols in the parser are the tokens in the lexer. Non-terminals are the symbols created to facilitate creation of the grammer. Productions are rules for combining terminals and non-terminals to create non-terminals. The following set of productions represent our grammer.

The non-terminal at left hand side of the first production will be used as the root of the parse tree. For example, let us consider the following set of production.
  1. E->E+E
  2. E->num
Here num is a terminal symbol and E is non-terminal. Let us assume, the num symbol represents all integers. Then the symbol 2+2 can easily be formed as follows:
                                 |       |        |
                                 E       +        E
                                 |                |
                                 2                2

However, in this grammer, the symbols 2+3+4 can be written in two different ways, as shown below.
                                 |       |        |
                                 E       +        E
                                 |          ______|______
                                 2          |     |     |
                                            E     +     E
                                            |           |
                                            3           4

                                 |       |        |
                                 E       +        E
                           ______|______          |
                           |     |     |          4
                           E     +     E
                           |           |
                           2           3

Hence, this grammer is called ambiguous. The grammer can be made unambiguous by making the following change. However, notice that the language generated by both the grammers are same.
  1. E->E+num
  2. E->num

Making an unambiguous grammer is very important for parser generators. The following show the grammer used by us. However we will make the parser by hand coding.
  1. E->E1
  2. E->E+E1
  3. E->E-E1
  4. E1->E2
  5. E1->E1*E2
  6. E1->E1/E2
  7. E2->(E)
  8. E2->num
  9. E2->var
  10. E2->+E2
  11. E2->+E2
  12. E2->-E2
  13. E2->-E2

LALR(1) Parsing: LALR(k) or Look Ahead Left Right parsing is the most commonly used parsing technique. It is left-right, because it reads the symbols from left to right, and then analyzes them from right to left. It means that the right most symbols are matched for any possible non-terminal. If one exists, replacement is done. In practice, a stack can be maintained to contain the symbols, and then the non-terminal can be formed using the symbols present at the top of the stack. Notice above, that while resolving amguity, we have conciously put the leaf symbols on the right side to support this parsing technique. The following shows the parsing sequence for 2+3+4.
  1. 2: E2 <- E1
  2. +: E1 + <- E +
  3. 3: E + 3 <- E + E2 <- E + E1 <- E
  4. +: E +
  5. 4: E + 4 <- E + E2 < - E + E1 <- E

However, this technique will fail if we have 2+3*4. At step 4, we will end up having E * , which would be a dead end. Hence, before we apply the addition, we need to see wheather the next symbol is a * or /. Since we look one symbol ahead, its called LALR(1) parsing.

Note that in a real LALR parser, the symbols are processed with a DFA, and it makes sense to store the DFA states in the stack rather than the symbols themselves to save time. However, for simplicity, we will logically try to find a suitable non-terminal every time.

In the example below, the parse method implements this parsing technique. (Again a code tells a thousand words). However, I would point out that here the non-terminals would actually represent operators.

Semantic Analysis: Semantic Analysis is the part of compilation in which language sematics are processed, including variable scoping, class, method resolution etc. In our case, the only analysis would be to find the variables. But it would be easier for us to do that after intermediate code generation, so we will do that then.

Intermediate Code Generation: As I pointed out earlier, java bytecode is equivalent to postfix operators, hence intermediate code generation would simply be the post order traversal of the parse tree.

Optimization: We would remove all unary + and all couples of unary -.

Code Generation: Now it is very simple, as the symbols are already in postfix form, it will be simply taking one by one and converting to code.

The following is the code for the compiler for expression. It creates a class file that has a main method. If variables are created, their value must be provide during invocation of the generated class in the order of their declaration (variables are declared when they are used for the first time).

How to check errors: In case you make a mistake, it is very difficult to know the problem from the class verification error shown by the JVM. BCEL comes with a built-in class verifier called org.apache.bcel.verifier.Verifier. Below is a sample output.

bash-4.1$ java -cp /home/debasish/Download/bcel-5.2/bcel-5.2.jar:. org.apache.bcel.verifier.Verifier com.geekyarticles.bcel.ExpressionTest
JustIce by Enver Haase, (C) 2001-2002.

Now verifying: com.geekyarticles.bcel.ExpressionTest

Pass 1:
Passed verification.

Pass 2:
Passed verification.

Pass 3a, method number 0 ['public static void main(String[] args)']:
Passed verification.

Pass 3b, method number 0 ['public static void main(String[] args)']:
Passed verification.

Pass 2: Attribute '<LocalVariableTable: LocalVariable(start_pc = 0, length = 51, index = 0:String[] args)>' as an attribute of Code attribute '<CODE>' (method 'public static void main(String[] args)') will effectively be ignored and is only useful for debuggers and such.
Pass 2: SourceFile attribute 'SourceFile(ExpressionTest.exp)' has a funny name: remember not to confuse certain parsers working on javap's output. Also, this name ('ExpressionTest.exp') is considered an unqualified (simple) file name only.

To run this program, run java ExpressionCompiler "2+3.65*8-+adc_v* (-(+1))*(adf+$v3)+2.35 ". It will create a class com.geekyarticles.bcel.ExpressionTest. Now you can run java com.geekyarticles.bcel.ExpressionTest 1 2 3 to see the result.


Anonymous said...

Very nice post

Post a Comment