Phases and Rules


Now the EK9 grammar has stabilised, it's time to ensure that the EK9 code written by the developer actually makes sense. All of these phases use ANTLR Listeners/Visitors but all delegate processing, rules and checks to functions.
These functions tend to be composed with even more functions. Each one is designed to be focused on one task (or a limited number of related tasks). This is in contrast to the first prototype that used a more Object-Oriented solution. This quickly became very complex. The function based approach is much simpler and enables much more reuse through composition.
Phases/Passes
With the EK9 grammar and language structure it is not possible to validate it all in a single 'pass'. This is because EK9 supports:
Forward use (i.e. reference before declaration)
Type inference
It does not include the concept of header files
Some rules can be checked very early and so:
Can emit errors very quickly, speeding up the development cycles
Other rules need more details and so checks can only occur in later phases
This means that EK9 source code can reference other functions, classes, etc. from other EK9 source files or even the same source file when they are declared after their use. Moreover, EK9 has a form of type inference within code blocks meaning that the compiler can work out the type of variable being used in many cases.
To accomplish this, the EK9 compiler visits the AST a number of times (in phases). Importantly some of these phases can process code structures concurrently on a per file basis. The EK9 compiler has been designed to use multi-core CPU's from the outset.
Phase Zero
Accepts source files that are to be compiled and parses them concurrently. The parsing of source files is quite an expensive operation. This phase leverages Java ANTLR to parse and validate the overall syntactic structure of the EK9 source code. The errors at this level are ANTLR generated. This is only focussed on syntax and not semantics (i.e. the meaning of the code).
The main issues in parsing are:
Reading data from slow storage (yes SSD's are slow)
Breaking that source data text into Lexemes
Building an Abstract Syntax Tree (AST) via the grammar
Given that most applications consist of hundreds if not thousands of source files (and that EK9 always works from source), it is essential that the above process is carried out quickly.
With the advent of widespread availability of multi-core CPU's, increasing cpu-core count and memory - concurrent file processing is viable.
However, this also means that the EK9 compiler will use a significant amount of memory. It is designed from the outset to only use memory, no file paging is employed at all. Big programs will require big memory and lots of CPU’s. But really modern development should at least try not be build huge monolithic applications where possible. They should be designed to be modular.
It is during this phase that an appropriate error listener is configured and associated with each source file. Subsequent phases will emit errors in compilation using this error listener object.
See code documentation for details of phase zero.
Phases 1-5 include layered and progressive semantic checks over and above enriching the internal Symbol definition and processing.
Phase One
Once each EK9 source file has been loaded and parsed it can be 'visited'. This means the AST can be traversed and Symbols defined by the EK9 developer can be identified and recorded. This all from data now held in memory.
This is in effect the first real 'pass' through the EK9 code, it is run in a concurrent manner. However, this is where things start to get complex.
Where multiple EK9 source files form part of the same module the Symbols must all be recorded in that same module. Moreover they must not clash with each other.
To accomplish this, the EK9 compiler protects modules with concurrency locks. This is designed to prevent multiple processing threads altering the internal state at the same time.
There is actually multiple forms of Symbol recording:
Against an AST tree node
Within a module
Within some form of aggregate/function in a module
It is also during phase one that types are identified (where possible) on variables and properties. Typically these are just when literals are used, in general this is the approach the EK9 compiler takes. It builds up a more detailed picture of the types as each of the phases is processed.
See phase one details for some of the many functions that get called during this phase. These are all just basic checks over and above the ANTLR grammar.
In general the ANTLR grammar has been made more flexible and simpler, this has enabled more rules and better error messages to be built into the EK9 compiler. It also simplifies the grammar to some extent.
Phase Two
During Phase One all of the main aggregates and functions will have been defined in a very basic skeleton form, this phase starts to add more detail around those constructs mainly focusing on type information.
The main purpose of this phase is to identify and resolve:
Explicit type use
Explicit generic type use
Simple ‘constructor based’ inferred type use including construction of parameterised genetic types
The processing in this phase also does a number of additional basic checks (semantic checks) now that Symbols and their relationships have at least been outlined.
This will include ensuring that types are open for extension and are of the same 'Genus'. This means it is not logical or allowed for a 'Class' to extend a 'Record' for example.
It is also during this phase that functions are examined to see if they fall into one of the following categories:
Consumer
BiConsumer
Acceptor
BiAcceptor
Supplier
Provider
Function
BiFunction
UnaryOperator
Predicate
BiPredicate
Assessor
BiAssessor
As EK9 treats functions in a polymorphic manner it automatically makes those functions 'super types' of the above common patterns - where their arguments, return values and 'pure' nature match. The generic functions above are really just common patterns of function signatures.
If a function does ‘match’ the signature of one of the above generic functions, then its ‘super function’ (a bit like a ‘super class’ for aggregates) is set. It should also be noted that each of the generic functions list above have a ‘super function’ of Any.
While it may seem strange to give ‘functions’ a sort of hierarchy - it enables ‘functions’ as well as ‘aggregates’ to be treated in a polymorphic manner (sub-typing).
See phase two details for all the range of operations that are carried out in this phase.
Phase Three
This is a key phase as it checks for resolution of all symbols. It also deals with processing and deducing inferred types as part of code block expressions.
There are many rules, checks (more semantic assertions) and processes in this phase, indeed when this phase is triggered and the AST is 'visited' each tree node is processed on both entry and exit.
The Listener is the hook in from the ANTLR infrastructure, but really does very little other than calling one or more of the functions that perform the key processing of the specific EK9 language construct.
This approach has been taken to ensure there is a clear separation of concerns and a single responsibility for each of the aspects of processing.
Any common or support functions are pulled out and made reusable in different phases.
During phase three many functions are called and most either populate/augments the Symbols identified or they process rules and emit compilation errors.
Phase Four
This phase checks that where a real type (or indeed a conceptual type in the case of generics using generics) has been employed. When parameterising a generic type with one or more types those types must support all the operators needed (used within the generic type). This is in the situation where generic/template types have been defined and operators have been ‘assumed’.
This approach enables generic types to be created with the assumption that when parameterised those type will have the right operators. Clearly the EK9 developer may attempt to use the generic type that does not have an essential operator - in this case a compiler error will be emitted. The EK9 developer can then add the operator to their type.
Phase Five
Now all symbols have been identified and all references resolved. This phase is the PRE_IR_CHECKS phase. It is designed to be the last set of semantic checks that can be made with just the ANTLR AST and the symbols identified.
This is done because the generation of the ‘Intermediate Representation’ is quite costly (in time and memory), so any obvious issues that can be identified now can cause the compilation to fail as early as possible.
The typical checks in this phase are:
Variables being used before initialised
Return values not always being initialised
Safe access on Optional/Result methods such as get(), ok() and error()
Guard expressions and uninitialised return values (if/for/try/while/switch, etc)
These checks make the EK9 language quite opinionated in terms of what ‘good code’ and ‘bad code’ is. From experience I’ve found that many of the longer term bugs and defects have been caused by issues relating to these checks. Hence, I’ve added these checks in to stop me writing code that could cause errors.
Phase Six
Resolving external libraries and built-in code for EK9 code that is marked as extern.
Resolution/linking of the built-in types that come as part of EK9 or in the future when there are other platform specific libraries/modules.
The EK9 compiler comes with lots of predefined and built in types and functions. But in reality they are only defined in terms of being and external interface - in other words they have no concrete implementation.
While this may seem strange, EK9 is designed to be able to have multiple ‘back-ends’ to produce different type of executable code (see later phases). For example it could be that the EK9 compiler (while written initially in Java) could produce outputs of:
Java byte code for the EK9 applications developed
LLVM code and then several final binary outputs
Direct platform specific binary outputs (or even cross platform support)
Phase Seven
Focusses on the creation of an ‘Intermediate Representation’, this layer is the full abstraction away from the EK9 language and is much more general in nature. It does away with specific checks (as earlier compiler phases have ensured that the structures and semantics are coherent.
The ‘Intermediate Representation’ design is really important, it has to remove most (if not all) of the EK9 language specifics and move towards something much more general. It must also avoid becoming too ‘target architecture’ focussed as well. It must however reify specific information from the EK9 program (such as type information). This will be quite a balancing act to get right.
There is little difference at the IR level between a Component, Class, Trait, Text or Record - they are all just really an aggregate. Strangely you can also consider a Function or a Dynamic Function to be just an aggregate with one method. This works very well for the dynamic functions as they can actually capture data as properties (much like an aggregate with no accessor methods for those properties).
Even operators now just become methods on those aggregates. This whole approach enables the IR phase to just create various ‘flavours’ of aggregate- it will annotate/reify them with sufficient detail that they can be identified during code generation and runtime in a very specific way.
But this approach is an essential one to enable the ‘dispatcher’ and ‘function delegate’ approach to work as now everything is just an aggregate (i.e. it is an EK9 ‘Any’). So even instances of functions become ‘objects’ and can be passed around. But the dispatcher code will need to be able to access the ‘Any’ and find out its real type so that appropriate dispatcher methods can be called.
For example; within a module there will be constants, these will just be ‘instances’ of an ‘aggregate’ of a specific type (i.e. Float, Date, etc.). But so will named functions, they too will just be ‘instances’ of a ‘function’. A named function can therefore be considered just a constant but of a specific ‘function’.
Phase Eight
All of the EK9 developer created generic/template types and their concrete (parametrised forms) can now be included in the ‘Intermediate Representation’. This may take several forms (unsure which approach to take at present). But it is quite possible that some form of simple aggregate (without the implementation is created and a real concrete implementation is delegated to (but in a very general and untype safe way), with casting to and from real concrete types.
The alternative would be to create real implementations from the generic type/template in a ‘cookie cutter’ fashion. This is likely to create a significant amount more code. This approach probably won’t be taken.
Phase Nine
This phases is designed to be a place-holder for Intermediate Representation Analysis and Optimisation. Initially this will not be implemented. It can be quite complex and time consuming to implement and really I’d like to move ahead with code generation (so I can finally see something coded in EK9 actually run).
Phase Ten
Code (Byte code, LLVM, binary) will be generated using the ‘Intermediate Representation’ and associated Symbol data.
At this point - my view on how this will be efficiently implemented is quite sketchy (TBH), how much caching can be done to avoid regeneration of code sections that are unchanged is not clear.
Phase Eleven
This is just a placeholder for optimisation of generated code, for some architectures this will be essential, but for others (like JVM based architectures) most of the optimisation will be done at runtime.
Phase Twelve
Finally, the last phase, this is just packaging. This could be into an executable either for the platform this compilation is running on or maybe for another platform if ‘cross-compilation’ is required. For Java byte code generation this phase would probably produce a ‘jar’ file.
Summary
While the first few phases above are quite detailed, you can see that the latter few are much more general. This is because most of the first phases (up to phase 5) have been implemented (I’m sure there will be bits missing that come to light as the later phases are developed).
The description of phases 6-8 are quite detailed (as they are now become more obvious to me as to what will be needed). I’m still unsure how to implement the resolution/linking of types that have been defined via ‘extern’ interfaces. Clearly this does depend on the ‘target architecture’. So for Java for example I may just load the ‘EK9-Lang.jar’ (that I’ll need to code up). Then using introspection check that a method on say org.ek9.lang.String as defined in the compiler does have all the correct methods and signatures as implemented in the Java target architecture and packaged in ‘EK9-Lang.jar’.
This would then able the compiler to generate the correct ASM (java byte code) to make a call from a developer EK9 code using the org.ek9.lang.String to the Java implementation.
For LLVM solutions, there would need to be some similar mechanism. Clearly this has been done for linking Python code to binary shared libraries. One mechanism to enable early resolution of symbols (before runtime), is to ensure that a binary shared library has a ‘known’ entry point.
By enabling the EK9 compiler to call this ‘known’ entry point in the shared library would enable the shared library to respond with some form of data structure that states what functions, constants, types etc exist with in it. So for the ‘EK9-Lang.so’ lets say compiled for Linux it might respond back with just a plain (but long) String that has basically the same types and structures as the built-in EK9 language interface as defined by the compiler. Indeed, maybe the current built-in types hard coded in the compiler would be removed and just the ‘EK9-Lang.jar’ used via the well known call. It would provide the ‘interface’.
In the same way that the compiler can check and resolve methods against the built-in ‘extern’ interface definition. It can be used at resolution/linking time to resolve the same types and calls on the interface supplied by the ‘EK9-Lang.so’. This does of course depend on the ‘EK9-Lang.jar’ or ‘EK9-Lang.so’ actually correctly reporting what it has (if this is incorrect then errors will occur either at linking time (for binaries) or runtime for Java ‘jar’ combinations.
This latter approach, does seem more scalable and would require less code in the compiler:
Locate library depending on target architecture
Make a call (using that target architecture) from the compiler to the ‘known’ entry point.
Use the ‘String’ response (containing the EK9 extern interface definition) in the compiler and parse it.
In effect this extern interface definition is that the EK9 developers code must use
Assuming that the developer of the EK9-*.jar or EK9-*.so (or whatever was provided did a good job and all the constructs outlined in the ‘EK9 extern interface definition’ were correctly implemented in the EK9-*.jar or EK9-*.so then at linking/runtime the calls would be resolved and work.
The above approach, would enable 3rd parties or EK9 developers that wanted to wrap and use existing binary or other code in this sort of interface and make it available to be used in EK9 code.
Subscribe to my newsletter
Read articles from Steve directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
