Expressions and Parsing - the Semantic Factor
Expression Parsing into Abstract Syntax Tree or Postfix Expression
The parsing of an expression creates the syntax tree for an expression, but the manner of translation for code synthesis suggests at another alternative parse method. The alternative means to parse an expression is to do the infix to postfix conversion in situ -- essentially the ubiquitous infix to postfix using a stack data structure. This would avoid generated an AST, and simplify the parse of a partition (a sequence of lexemes) to generate the syntax object or descriptor.
So the infix expression statement of:
x = y + z;
The parsing of an expression would be to convert infix to postfix, after handling the prefix and suffix that bound a sentence context. The postfix expression is:
x y z + =
Instead of parsing an expression recursively, the parse is linearly proportionate to the expression length.
The Complexity of Expression Postfix to Infix Conversion
The relatively simpler process of postfix to infix does have one snag--handling sub-expressions that are not overloadable, and are not explicitly part of the expression. Such operators are the access operators--such as array, attribute, and method access.
From the perspective of a string, the sub-expressions are sub-strings that must be handled as an atomic element of a single entity.
The expression statement of:
total = this.getCount + this`x ;
The expression is one of the following with the sub-expressions represented by one identifier:
total = SUBEXPR1 + SUBEXPR2 ;
The question is one of how to handle during the infix to postfix conversion the atomic sub-expressions without changing the meaning or intention of the expression.
Handling Sub-Expression Lexemes for Simpler Parse
The solution to the sub-expression problem is obvious--convert or translate the sub-expressions into something atomic to be handled like an identifier or literal in the infix to postfix conversion. In an AST parse, the sub-expression is a sub-tree, but it is not so simple what it is in the infix to postfix expression conversion. There is no AST created, so the sequence of lexemes cannot be parsed into a sub-expression tree. The difficulty is converting something that is hierarchical into something that is linear. (Note: XString is this exactly, turning XML from hierarchical data into a linear form of data.)
Instead of creating a sub-expression syntax tree, the principle is to take a series of lexemes that are created by a scanner or lexical analyzer and coalesce or by fusion simplify into a lexoid (a term for lex-prefix + oid-suffix). A lexoid is like an identifier or literal, except is not generated by a lexical analyzer, it is a contraction of lexemes.
Once a sequence or series of lexemes are coalesced into a lexoid, the expression can be converted from infix to postfix.
After the postfix expression is created, the lexoid can be expanded back into the original series of lexemes.
For the Mynx expression:
x = y`attr.doIt + z[0].fact(0);
Coalesce the sub-expressions into lexoids:
x = y`attr.doIt + z[0].fact(0);
The lexoids are formed from the sub-expressions that are a series of lexemes:
x = y`attr.doIt + z[0].fact(0);
x = [ Y ] [ Z ];
x = Y + Z;
The resulting expression using lexoids is then translated to postfix:
x Y Z + =
Converting the operator expression in postfix form to a method expression:
x Y Z + =
x [Y Z + ] =
[ x Y.add(Z) = ]
x.set(Y.add(Z))
The resulting method expression containing lexoids can then be expanded back to the form using lexemes:
x.set((y`attr.doIt).add(Z)
x.set((y`attr.doIt).add(z[0].fact(0))
This is tedious, the non-AST parsing is three steps:
- Coalesce lexemes into lexoids
- Convert the expression from operator to method expression
- Expand the lexoids into a series of lexemes
The benefit is that the expression is translated at the lexical level without creating an abstract syntax tree--so memory efficient.
Recursive Expressions for Translation
Often, the sub-expressions are expressions within the top-level expression. Such as:
x[y+1] = this.doIt(x*2+y) + q`x;
The sub-expressions are full expressions that need to be translated:
- y+1 - array sub-expression
- x*2+y - method parameter sub-expression
Essentially the sub-expressions are full expressions, and need to be parsed as expressions - converted into method expressions. But this is simply a recursive parse within the original expression parse.
The intriguing part is that when transformed into a lexoid, before substituting, the lexoids can be parsed -- so the conversion of a lexoid to a series of lexemes is also an expression parse that is recursively called. Illustrating the translation of the expression:
x[y+1] = this.doIt(x*2+y) + q`x; //original operator expression
LEX0 = LEX1 + LEX2; //coalesce lexemes into lexoids
LEX0 LEX1 LEX2 + = //convert infix to postfix form
LEX0 LEX1.add(LEX2) = //convert operator to method expression
LEX0.set(LEX1.add(LEX2)) //expand lexoids to lexemes -- but parse expression
LEX0 => x[y+1] => x[y.add(1)]
LEX1 => x*2+y => x.mul(2).add(y)
LEX2 => q`x => q`x //no translation -- access operator `
LEX0.set(LEX1.add(LEX2))
x[y.add(1)].set(LEX1.add(LEX2))
x[y.add(1)].set((x.mul(2).add(y)).add(LEX2))
x[y.add(1)].set((x.mul(2).add(y)).add(q`x))
In the parsing of the sub-expressions, should (!!?!) another expression be contained within the sub-expression (a sub-sub-expression), then again the conversion would process the expression, and before expanding the lexoids into lexemes another invocation to an expression parse.
The translation remains at the lexical or word (instead of sentence) level, and involves translating the original expression of N-lexemes into an expression of c*N (c is the expansion of operator to method call in added lexemes the single lexeme + becomes a sequence of four lexemes . add ( ) . Thus an expansion of 1:4 or a constant of 4. The constant c is a generalized constant for the expansion of the original number of tokens multiplicatively.
For a recursive expression containing other expressions it becomes c*N and then c*c*N or c*N because in Big-Oh the two constants c*c is yet another constant c. Thus the space and time are linear in relation to the expression size in terms of lexemes.
Currently, the Mynx compiler uses a parserExpression class to parse an expression consisting of a sequence or series of lexemes in a Sequence class. The result is a syntax object a descriptorExpression, which is in effect an abstract syntax tree. So essentially a conventional parse into a syntax tree. Later it might be interesting to alter the implementation to use the linear lexeme translation using lexoids in terms of time and space (i.e. performance of the compiler).
Sentential Parsing Algorithm as Prerequisite
An important requirement or pre-condition is to have an entire expression statement--the entire sentence bounded as a partition. This requires using the sentential parse algorithm instead of a conventional parse algorithm. The prefix and suffix of a sentence identify the sentence, and before a parse, the sentence is extracted from the sequence of lexemes returned by the scanner.
But in terms of properties, a sentential parse allows a sentence to be transformed but without needing to parse it--the parse is to re-arrange the lexemes for the same meaning but in different form. This feature is impossible using a LL or LR (what I term a regular parse algorithms) because both algorithms operator on a piece of the sentence or production being parsed, not the entire sentence as a sequence of lexemes--a partition.
Labels: mynx expression, mynx parse, mynx semantics

