Think Out Loud: Mynx Programming Language Design: Semantic Checks in Mynx -- A Parse by any other Name...

I’ve had several folks ask the question or comment about implementing the semantic checks for Mynx, that to paraphrase... “Just implement the standard semantic checks and get the compiler done.” It reminds me of my first roommate in college, who asked why you couldn’t plug a cable into a computer monitor (I had my computer, but not a television, in 1992 shows my priorities...) since “they’re basically the same.” An over-simplification, and in the case of the compiler question/comments they do not realize I’m using a new and different parse algorithm. So a new programming language, with its own semantics (such as for multiple disjoint inheritance), and the dynamics of a new parsing algorithm. Theory on two fronts.

The New Parsing Algorithm - Sentential Parse or Sentence-Oriented

How is this parsing algorithm different? In two ways the parse algorithm is different in two tasks:

Identify and extract discrete sentences in the grammar from the input stream of lexemes from the scanner.
Parse each sentence independently of the other sentences.

A regular LL or LR parse does both operations as the parse. The new parsing algorithm does both, but in separate stages.

More specifically, the normal phases in a compiler text of:

SOURCE => scanner => parser => semantics => generate => BINARY

SOURCE => scanner => partition => parser => semantics => generate => BINARY

The new parse algorithm requires a new compiler phase, i.e., partitioner which partitions the input stream of lexemes into discrete sentences, and then the parser parses each sentence. Hence the name sentential parsing algorithm, or sentence-oriented parse algorithm.

Partitioning - Splitting the Input Lexemes into Discrete Sentences or Partitions

The partitioner identifies and extracts each sentence. Sentences are identified by a unique pair or 2-tuple of (prefix, suffix) lexemes. The prefix and suffix lexemes can be any legitimate lexeme in the grammar/language, so long as the combination of the two are unique.

In terms of automata, a simple finite-state machine is required, with the prefix and suffix lexemes being a sequence of states that enter and leave an overall state to store or concat each lexeme into a partition or sentence. The automata power is equivalent to that used by lexical analysis -- instead of, of symbols or characters forming a lexeme, lexemes form a partition or sentence.

Example in Mynx, a program Nil.mynx:



with mynx.core.*;

program Nil is

    null;

end program;

The sentences or partitions are with 2-tuple identifying are:

(with, ;) = with sentence
(program, is) = program-header sentence
(null, ;) = null-statement sentence
(end, program ;) = program-footer sentence

Why divide, identify, and extract each sentence in the input source? What advantages does it provide:

Sentences are extracted without lookahead or any backtracking.
Sentences are discrete and independent of any before or after.
Sentences are identified, so no ambiguity for the parse.

The discrete extraction of each sentence from the input also has potential for parallelization of parsing. A common source input is discretely partitioned into independent sentences for parsing. I hope one day to explore this possible parallel parse.

Parsing - Analyzing and Constructing Syntax Tree from a Sentence

The parser, as it operates on a discrete, finite sentence whose identity is known, focuses on verifying that the sentence is a correct syntax construction in the language grammar. No lookahead, or backtracking, or unknown input length.

A finite, discrete sentence also allows parsing bi-directionally, from left-to -right, and right-to-left. Essentially parse from each end of the sentence, not strictly left-to-right with a left- or right-most derivation.

It is also possible to determine if a particular lexeme is present. During a partition of a sentence, the lexemes are stored for rapid search, such as keyword, identifiers, literals, and operators in a map or hashtable, allowing rapid determination if a particular sentence syntactic feature is present.

One useful feature is that when parsing, the identity of the sentence is known, so the parse does not have any uncertainty in what is being parsed -- the final goal or result is known. Hence between the start of a parse, and the finish, what is expected in terms of the grammar/language is known, leading to more effective error correction and reporting.

In short, the sentential parse has the properties of:

Parse bi-directionally, from left-to-right, and right-to-left on a finite length sentence - not a possibly infinite input stream of lexemes.
Parser can check in a finite amount of time for a particular lexeme for a syntactic feature--no lookahead of undetermined length.
Parse parses with end goal known, so better error correction/reporting than if the parse is parsing to an unknown goal.

Illustrative Features with Sentential Parse

One feature in the Mynx language is that the return type of void for a class or program method can be implicit. This is not possible for LL or LR parse algorithms.

The Mynx class Example.mynx demonstrates:


class Example is

    public doNothing is  //implicit void return type
        null;
    end doNothing;

end class;

If this simple example is parsed by the JavaCC (Mynx.jj) generated parser which is LL(k), the parse will fail. Why? Because void is used in the lookahead to determine what production rule in the grammar to use for parsing. As Mynx uses the sentential parse, void is not in the prefix or suffix to identify a sentence -- so can be optional as it is in Mynx. (The design rationale is if you do not declare a return type there is nothing to return in the method -- hence void, although it can be declared if desired.)

Semantic Analysis with Sentential Parser and Partitioning

The scanner, partitioner, and parser along with the lexeme, partition, and syntax classes have been implemented -- so the sentential parse and partitioning functionality works. The semantic analysis or semantic checks have been more difficult -- existing theory is for LL and LR parse algorithms.

The semantic checks are different as the parser is different, the semantic checks start for each sentence after it is successfully parsed. Hence semantic checks are oriented at the sentence level. The semantic checks within each sentence are mutually exclusive, but for more complex checks require information to be communicated and stored. One is semantic check, and the other is semantic context.

The original implementation was to use the visitor design pattern, and centralize all checks in a semantic class. The problem is that this approach is not workable for discrete sentences, each sentence should perform each check,and communication information upward to the more complex semantic structures. Hence the software architecture is unworkable -- workable, but requires work arounds. The visitor pattern centralizes functionality, but requires data to be accessed, queried, and communicated. The visitor design pattern is useful for flexibility in adding functionality, but not for distributed, mutually exclusive functionality.

The semantic checks need to be oriented around a sentence, and then within a higher organizing structures:

sentence - basic atomic sentence
statement - multi-sentence statement such as if, case, loop
method - constructor, destructor, method
unit - program, class

The structures are hierarchical, with sentences as the leaves of a tree or hierarchical structure, and the top-level unit as a class or program.

The semantic analysis phase requires a fresh, sentence-oriented approach, which is not found in any existing compiler text. That is why semantic checks for the Mynx language are more involved, it requires new theory and new code -- a pushing the envelope on two fronts.

The metaphor I use is that the sentential parsing algorithm is like an electric automobile and the LL/LR parsing algorithm are like an internal-combustion engine based on gasoline -- both do the same thing, but have their own unique problems and complications. Regenerative braking and battery recharge time are very different from fuel economy and engine emissions.

In effect, Mynx is a new car using a new engine, and the nuances of a new programming language are complicated and challenging enough (one feature can interact in many ways with another...) but then throw into the mix a new parsing algorithm that entails new approach to semantic checks. Originally I anticipated that parsing would be the most difficult part, but now I realize semantics are much more complex.

For the sentential parsing algorithm, I’d like one day to try and parse an existing language to focus on the parse algorithm, and grammar transformation into a sentential form. I’ve worked out a simple grammar in a widely used example language -- Wirth’s PL0 from “Algorithms + Data Structures = Programs” and have outlined a research monograph. But for now, I want to implement Mynx’s semantic checks, requiring a fresh approach. Programming language design is a challenge in and of itself, but then implementation is another, particularly when over-zealousness and curiosity leading to using different approach -- my faux pas. Otto von Bismarck once warned about a war on two fronts for Germany, and I think it is applicable in any endeavour, do not try to push the bleeding edge from two directions -- it can be a bloody fiasco (pun intended).

Labels: LL(k), LR(k), mynx, parse, parser, partition, scanner, sentential parse

Think Out Loud: Mynx Programming Language Design

Saturday, March 03, 2007

Semantic Checks in Mynx -- A Parse by any other Name...

The New Parsing Algorithm - Sentential Parse or Sentence-Oriented

Partitioning - Splitting the Input Lexemes into Discrete Sentences or Partitions

Parsing - Analyzing and Constructing Syntax Tree from a Sentence

Illustrative Features with Sentential Parse

Semantic Analysis with Sentential Parser and Partitioning

About Me

Previous Posts