Wednesday, November 28, 2007

Expressions and Parsing - the Semantic Factor

Mynx converts an operator expression into a method expression in order to do code synthesis after and during semantic analysis. But it must parse the expression, and create an abstract syntax tree (AST) from which the high-level language code is synthesized.

Expression Parsing into Abstract Syntax Tree or Postfix Expression



The parsing of an expression creates the syntax tree for an expression, but the manner of translation for code synthesis suggests at another alternative parse method. The alternative means to parse an expression is to do the infix to postfix conversion in situ -- essentially the ubiquitous infix to postfix using a stack data structure. This would avoid generated an AST, and simplify the parse of a partition (a sequence of lexemes) to generate the syntax object or descriptor.

So the infix expression statement of:

x = y + z;


The parsing of an expression would be to convert infix to postfix, after handling the prefix and suffix that bound a sentence context. The postfix expression is:

x y z + =


Instead of parsing an expression recursively, the parse is linearly proportionate to the expression length.

The Complexity of Expression Postfix to Infix Conversion



The relatively simpler process of postfix to infix does have one snag--handling sub-expressions that are not overloadable, and are not explicitly part of the expression. Such operators are the access operators--such as array, attribute, and method access.

From the perspective of a string, the sub-expressions are sub-strings that must be handled as an atomic element of a single entity.

The expression statement of:

total = this.getCount + this`x ;


The expression is one of the following with the sub-expressions represented by one identifier:

total = SUBEXPR1 + SUBEXPR2 ;


The question is one of how to handle during the infix to postfix conversion the atomic sub-expressions without changing the meaning or intention of the expression.

Handling Sub-Expression Lexemes for Simpler Parse



The solution to the sub-expression problem is obvious--convert or translate the sub-expressions into something atomic to be handled like an identifier or literal in the infix to postfix conversion. In an AST parse, the sub-expression is a sub-tree, but it is not so simple what it is in the infix to postfix expression conversion. There is no AST created, so the sequence of lexemes cannot be parsed into a sub-expression tree. The difficulty is converting something that is hierarchical into something that is linear. (Note: XString is this exactly, turning XML from hierarchical data into a linear form of data.)

Instead of creating a sub-expression syntax tree, the principle is to take a series of lexemes that are created by a scanner or lexical analyzer and coalesce or by fusion simplify into a lexoid (a term for lex-prefix + oid-suffix). A lexoid is like an identifier or literal, except is not generated by a lexical analyzer, it is a contraction of lexemes.

Once a sequence or series of lexemes are coalesced into a lexoid, the expression can be converted from infix to postfix.

After the postfix expression is created, the lexoid can be expanded back into the original series of lexemes.

For the Mynx expression:


x = y`attr.doIt + z[0].fact(0);


Coalesce the sub-expressions into lexoids:

x = y`attr.doIt + z[0].fact(0);


The lexoids are formed from the sub-expressions that are a series of lexemes:

x = y`attr.doIt + z[0].fact(0);
x = [ Y ] [ Z ];
x = Y + Z;


The resulting expression using lexoids is then translated to postfix:

x Y Z + =


Converting the operator expression in postfix form to a method expression:

x Y Z + =
x [Y Z + ] =
[ x Y.add(Z) = ]
x.set(Y.add(Z))


The resulting method expression containing lexoids can then be expanded back to the form using lexemes:


x.set((y`attr.doIt).add(Z)
x.set((y`attr.doIt).add(z[0].fact(0))


This is tedious, the non-AST parsing is three steps:

  1. Coalesce lexemes into lexoids

  2. Convert the expression from operator to method expression

  3. Expand the lexoids into a series of lexemes



The benefit is that the expression is translated at the lexical level without creating an abstract syntax tree--so memory efficient.

Recursive Expressions for Translation



Often, the sub-expressions are expressions within the top-level expression. Such as:

x[y+1] = this.doIt(x*2+y) + q`x;


The sub-expressions are full expressions that need to be translated:

  1. y+1 - array sub-expression

  2. x*2+y - method parameter sub-expression



Essentially the sub-expressions are full expressions, and need to be parsed as expressions - converted into method expressions. But this is simply a recursive parse within the original expression parse.

The intriguing part is that when transformed into a lexoid, before substituting, the lexoids can be parsed -- so the conversion of a lexoid to a series of lexemes is also an expression parse that is recursively called. Illustrating the translation of the expression:

x[y+1] = this.doIt(x*2+y) + q`x; //original operator expression
LEX0 = LEX1 + LEX2; //coalesce lexemes into lexoids
LEX0 LEX1 LEX2 + = //convert infix to postfix form
LEX0 LEX1.add(LEX2) = //convert operator to method expression
LEX0.set(LEX1.add(LEX2)) //expand lexoids to lexemes -- but parse expression

LEX0 => x[y+1] => x[y.add(1)]
LEX1 => x*2+y => x.mul(2).add(y)
LEX2 => q`x => q`x //no translation -- access operator `

LEX0.set(LEX1.add(LEX2))
x[y.add(1)].set(LEX1.add(LEX2))
x[y.add(1)].set((x.mul(2).add(y)).add(LEX2))
x[y.add(1)].set((x.mul(2).add(y)).add(q`x))


In the parsing of the sub-expressions, should (!!?!) another expression be contained within the sub-expression (a sub-sub-expression), then again the conversion would process the expression, and before expanding the lexoids into lexemes another invocation to an expression parse.

The translation remains at the lexical or word (instead of sentence) level, and involves translating the original expression of N-lexemes into an expression of c*N (c is the expansion of operator to method call in added lexemes the single lexeme + becomes a sequence of four lexemes . add ( ) . Thus an expansion of 1:4 or a constant of 4. The constant c is a generalized constant for the expansion of the original number of tokens multiplicatively.

For a recursive expression containing other expressions it becomes c*N and then c*c*N or c*N because in Big-Oh the two constants c*c is yet another constant c. Thus the space and time are linear in relation to the expression size in terms of lexemes.

Currently, the Mynx compiler uses a parserExpression class to parse an expression consisting of a sequence or series of lexemes in a Sequence class. The result is a syntax object a descriptorExpression, which is in effect an abstract syntax tree. So essentially a conventional parse into a syntax tree. Later it might be interesting to alter the implementation to use the linear lexeme translation using lexoids in terms of time and space (i.e. performance of the compiler).

Sentential Parsing Algorithm as Prerequisite



An important requirement or pre-condition is to have an entire expression statement--the entire sentence bounded as a partition. This requires using the sentential parse algorithm instead of a conventional parse algorithm. The prefix and suffix of a sentence identify the sentence, and before a parse, the sentence is extracted from the sequence of lexemes returned by the scanner.

But in terms of properties, a sentential parse allows a sentence to be transformed but without needing to parse it--the parse is to re-arrange the lexemes for the same meaning but in different form. This feature is impossible using a LL or LR (what I term a regular parse algorithms) because both algorithms operator on a piece of the sentence or production being parsed, not the entire sentence as a sequence of lexemes--a partition.

Labels: , ,

Sunday, November 18, 2007

Semantics of Operator Overloading and Expressions II

In my first blog entry I explained how the Mynx compiler translates operator expressions into complete method expressions. In effect, the compiler is resolving the operator overloads with the methods that are overloaded for a given type or class and the operator.

There are other use cases of expression statements such as access operators and non-overloadable operators. While the algorithm to translate from an operator expression to method expression does not change, the type of code synthesis does -- depending upon the underlying high-level language target.

Expressions with Access Operator: Array, Attribute, and Method



The access operators access or invoke access on a language element. The access operators invoke:

  1. Array - access element within an array using [ ] brackets.

  2. Attribute - access element within a class using ` tic

  3. Method - access invoke method of a class using . dot



Note the literal integer of 1 is translated into literal object LIT_INT_1 in the code synthesis so that literals are instance objects.

Array



An example of a Mynx expression with an array access operator is:

x = y[0] + 1;
x y[0] 1 + =
x y[0].add(1) =
x.set(y[0].add(1));
x.set(y[0].add(LIT_INT_1));


Attribute



An example of a Mynx expression with an attribute access operator is:


x = y + this`z;
x y this`z + =
x y.add(this.z_) =
x.set(y.add(this.z_));


Method



An example of a Mynx expression with an method access operator is:


x = this.fact(1) + this`y;
x this.fact(1) this`y + =
x this.fact(1).add(this.y_) =
x.set(this.fact(1).add(this.y_));


Mixed



An example of a Mynx expression with a mix of access operators is:


x = y[0] + this.fact(0) * this`z - 0;
x y[0] this.fact(0) this`z 0 - * + =
x y[0] this.fact(0) [ this`z 0 - ] * + =
x y[0] [ this.fact(0) this.z_.sub(0) * ] + =
x [ y[0] this.fact(0).mul(this.z_.sub(0)) + ] =
x [ y[0] this.fact(0).mul(this.z_.sub(0)) + ]=
x [ y[0].add(this.fact(0).mul(this.z_.sub(0))) = ]
x.set(y[0].add(this.fact(0).mul(this.z_.sub(0))));


Non-Overloadable Operator Expression to Method



Not all the operators in Mynx are overloadable either directly, such as + or -, or indirectly such as += or *=. The operators of in, as, is, or to are not overloadable. (Note: The operators use an English word instead of symbol to distinguish it from the overloadable operators--a language design decision.)

The three operators in more specificity are:

  1. as - creation assignment operator

  2. in - instance of check operator either type or reference

  3. is - reference assignment operator

  4. to - instance of cast conversion operator



The non-overloadable operators are closer to code synthesis in the high-level language.

For the assignment reference assignment operator is, consider the simple expression statement:

x is y;


The reference handle to the identifier x is assigned explicitly the reference handle to the identifier y--both x and y refer or point to the same thing. Converting to postfix expression form:

x y is


The type of x and y must be the same, and the is operator is translated into direct assignment in the HLL code synthesis. Thus the expression in HLL code synthesis is:

x = y;

Reference assignment in both C# and Java is the = operator.

A more complex expression statement is:

x is y to Int;


Translated to an equivalent postfix operator expression statement is:

x y Int to is


The translation from postfix operator expression to the HLL code synthesis expression is:

x [y Int to] is
x [( (Int) y ) ] is
[x ((Int) y) is]
x = ((Int) y) ;


The cast and assignment reference have been translated into the syntax form for the HLL code synthesis--interestingly enough both C# and Java (and C/C++) have the same syntax redux for code synthesis. The following Mynx source code fragment illustrates the as operator:

var Int x to null;

x as Int("37");


The as operator expression in postfix is:

x Int("37") as;


The translation from postfix operator expression is:

x = new Int().construct("37");


The new operator creates the instance, but by Mynx code synthesis uses an explicit constructor named construct.

A more complex operator expression involves the Mynx in operator in its two uses, as a type check and reference assignment check. Consider the Mynx operator expression in an if statement header:

if(x in Int && x in y)
null;
end if;


The operator expression is a Mynx Bool type--evaluates to true or false. But the expression must still be translated to postfix form, and then to the HLL code synthesis translation:

x in Int && x in y

The postfix form of the operator expression is:

x Int in x y in &&

The translation from an operator expression to the HLL synthesis is:

x Int in x y in &&
[ x Int in ] x y in &&
(x instanceof Int) [ x y in ] &&
(x instanceof Int) (x == y) &&
(x instanceof Int).logAnd( (x == y) )

The only overloaded operator in the expression is the logical and operator of &&.

For the HLL code synthesis, the if expression must return a logical value of true or false, but in the HLL synthesized language--Java or C#. Thus the translated expression is only partially complete, the general form from Mynx to the HLL programming language is:

if(EXPRESSION)

The HLL code synthesis translation is uses the Mynx Bool type class:

if(Bool.TRUE_ == EXPRESSION)


If the Mynx boolean expression returns a BOOL.TRUE, the == operator will then create a boolean true to the HLL. Using that information, the expression translated ultimately becomes:

if(Bool.TRUE_ == (x instanceof Int).logAnd( (x == y) ) )


The HLL code synthesis involves the HLL equivalent operators and the overloaded operators with the Mynx method.

Summarizing, the underlying code synthesis drives the HLL expression that is synthesized in the native language from a Mynx expression statement. For non-overloadable operators the synthesized expression is closer to the HLL programming language.

Labels: , , , , , , ,

Wednesday, November 14, 2007

Semantics of Operator Overloading and Expressions 1

During the external context semantic checks, operators for a type or class are associated with the method. The information is annotated to the syntax objects so that during code synthesis (code generation, but since I'm using a different approach I use different terminology) the data is available to create the high-level language code for an expression.

A top-level call to get the type and synthesize the code for an expression is passed to the child nodes of a node, and continues down the expression tree until it reaches an identifier, literal -- a atomic element of an expression.

As the code fragments as strings are then propogated upward to the parent node of an expression, the expression is synthesized from the sub-expressions.

The parse of an expression statement creates a syntax tree of interconnected syntax objects -- the classic abstract syntax tree (AST). During the external semantic checks operator associated with method identifier information is annotated, for code synthesis an expression of operators is translated into an expression of method calls.

The symbol table of declarations is accessed to determine the type of an identifier, and for an operator, the method associated with that operator for that class type--a semantic error if the operator is not overloaded.

Process of Operator Expression to Method Expression



Consider the following (infix) expression in Mynx:

x = y + z;


During the code synthesis, the infix expression is converted to the postfix expression on the fly. The (postfix) expression in Mynx:

x y z + =

Once in postfix form, the expression can be translated to a method expression.

The process involves going left-to-right and for an operator, if there are two sub-expressions (identifier, literal, or method expression) then the type of the left sub-expression and the operator determine the method call with the right sub-expression as the parameter.

In the postfix expression, the sub-expressions and operator are bracketed to indicate the sub-expression translation:

x [ y z + ] =


For x,y,z of type Int, and + is for the method "add", the translation is

x [ y.add(z) ] =

For the next sub-expression, the type Int, and = is for the method "set" the translation is:

[ x y.add(z) = ]


Note that the sub-expression
y.add(z)
is handled as an atomic unit, in this case as a sub-expression string.

The resulting translation is then one of:

x.set(y.add(z));


Thus the infix operator expression:
x = y + z;

is translated to a method expression of:
x.set(y.add(z));


The process requires type an external context semantic data annotated within the syntax objects representing an expression. In compiler parlance a syntax-directed translation.

Since an expression statement is a sentence, and the parse, semantics, and synthesis of code are centered around sentences, the process is encapsulated within the Mynx sentence object (in compiler terms a descriptor).

Processing Dual Assignment Operator Expression



The dual assignment operator expression contains both a + operator and = assignment operator. The dual assignment operator += does not support operator overloading directly--but indirectly the operators are each individually overloadable.

Hence the dual assignment operator expression:

x += y + z;


In postfix form is:

x y z + +=


But as the += dual assignment operator is not overloadable, so the infix expression needs to be rewritten with the individual overloadable operators:

x = (y + z) + x


Converting the rewritten expression to postfix is:

x x y z + + =


The processing of the expression is then:

x x [y z +] + =
x x [y.add(z)] + =
x [ x y.add(z) + ] =
x [ x.add(y.add(z)) ] =
[ x x.add(y.add(z)) = ]

x.set(x.add(y.add(z)));


Now the dual assign operation operators can be translated from an operator expression to a method expression.

Semantics are not a separate phase of a compiler, the semantics of operator overloading directly impacts the code synthesis (what is conventionally called code generation). The two phases of semantics and synthesis are grey areas of the compiler, not black-and-white separate stages. Unraveling one from the other for which phase is the more difficult part.

Labels: , , , , , ,

Website Spy Software