Lexer - First step to find meaning from a Raw file

Sawez FaisalSawez Faisal
5 min read

So, finally its time to do some coding .

A lexer is nothing but a piece of software whose job is to take in an input ,in our case the source code of the language and return a series of predefined tokens that corresponds to each and every character in the source code.

These tokens are the smallest building block on which we will perform manipulation so that the machine can make some sense out of the jargon of raw english file and convert into machine code.

It is really important that a lexer must be able to handle any sort of raw character that it is encountered with and should not crash . If the character is illegal we could do the handling in the later stages but if this stage crashes all of the later stages such as parsing and analysis cannot happen .

I could have used something as a lex or flex tool that does the job using regular expresssion matching but i wanted to implement expression matching from scratch ,besides it is not a difficult piece of code to write (My entire lexer is less than 300 line of code)

Lexer Class

To implement the lexer we will create a lexer class that will take in a string(our source file) and output a vector of tokens corresponding to the file.

We will categorise the raw file as a series of numbers , literals , strings , keywords , identifiers etc.

Below is a simple lexer class implementation, we will discuss this in detail

class Lexer{

    std::string source;

    std::vector<Token> tokens;
//vector is a built in data structure provided in c++ that 
//dynamically inserts elements into itself and change its size 
//accordingly.
//its like arrays but on steroids.
    public:

    Lexer(const std::string& source);
    std::unordered_map<std::string,TOKEN::TYPE>keywordsMap;

    std::vector<Token>lex(const std::string& source);

    void display();
    private:


    int start=0;
    int current=0;
    int line=1;

    //the main lex function
    void scanTokens();
    void scanNumber();
    void scanString();
    void scanIdentifier();
    bool endReached();
    char peek();
    char advance();
    bool isNumber(char c);
    bool isAlphabet(char c);
    bool isAlphanumeric(char c);
    std::string getString(std::string& source,int start,int current);
    void skipWhiteSpaces();
    bool match(char c);
    void displayError(std::string& buffer);


};

The class is designed such that it has an entry function that is public , called lex() and this function will act as the entry point for all other private function. This behaviour ensures that only the necessary part of the class is accessed by the outside world ( This property is called encapsulation)

Let us discuss that helper methods present in the class.

The main helper methods are :

advance() →To move forward if possible and store the current character.

peek() → To just check what is the available character at the next position.

skipWhiteSpace() → We want to ignore whitespaces as they provide us no meaning but is essential for the programmer for readability

match() →This function is particularly useful for operators like += or -= , basically where the meaning of the operator could completely change based on the next character’s type.

displayError() → to display error if we encounter some unwanted tokens.

The scanTokens() function as the name suggest scans tokens , the scanNumber() and scanString() is for numbers and strings . The scanIdentifier() function scan for identifers . At this point we would want to know which identifiers are keywords ,like main,while,if etc are reserved keywords by the language and can’t be used by the programmer . For this we use a simple HashMap of <(words) —> (Tokens)> .

HashMaps

Hashmaps are data structures that allows us to map a key to a value for quick loopup from the Hash table .

We will use c++’s built in map , specifically unordered map as it is quite efficient with an average time complexity of O(1).

    std::unordered_map<std::string,TOKEN::TYPE>keywordsMap;
 keywordsMap["for"] = TOKEN::For;
 keywordsMap["if"] = TOKEN::If;
 keywordsMap["elseif"] = TOKEN::ElseIf;

We would also keep pointers (index pointers not memory pointers) start and current ,

The start will sit on the first character of the sequence while the current will move and match based on the type. We would also keep a variable to store the line number , this would be particularly useful when printing out errror messages to the user.

Lets go through the methods in brief:

std::vector<Token> Lexer::lex(const std::string &source) {
  // till the end of file is not reached keep scanning and assigning tokens
  while (!endReached()) {
    start = current;
    scanTokens();
  }
  // the end of file token
  tokens.push_back(Token(TOKEN::Eof, "", line));
  return tokens;
}

Here we have our main lex function that returns a vector of tokens .

This lex function calls the scan tokens function.

void Lexer::scanTokens() {
//skip whitespaces 
//advance in the file
//do matching 
//store tokens in a vector
 }

The first set of tokens that i would like to get rid of is single character tokens like ( ,{, ,+,-,* , etc

Then we will implement two character matching like += or -= or //(comment)

At this stage we could implement scanNumbers ,ScanString,ScanIdentifier as well.

void Lexer::scanTokens() {
  skipWhiteSpaces();
  char ch = advance();
  // match for numbers
  if (isNumber(ch)) {
    scanNumber();
  }
  if (isAlphanumeric(ch) || isUnderScore(peek())) {
    scanIdentifier();
  }
  // match all the single tokens
  switch (ch) {
  case '(':
    tokens.push_back(Token(TOKEN::Lpar, "(", line));
    break;
  case ')':
    tokens.push_back(Token(TOKEN::Rpar, ")", line));
    break;
  case '{':
    tokens.push_back(Token(TOKEN::Lbraces, "{", line));
    break;
  case '}':
    tokens.push_back(Token(TOKEN::Rbraces, "}", line));
    break;
  case ';':
    tokens.push_back(Token(TOKEN::Semicolon, ";", line));
    break;
  case '+': {
    if (match('=')) {
      tokens.push_back(Token(TOKEN::PlusEqual, "+=", line));
    }

    else {
      tokens.push_back(Token(TOKEN::Plus, "+", line));
    }
    break;
  }
  case '-': {
    if (match('=')) {
      tokens.push_back(Token(TOKEN::MinusEqual, "-=", line));
    }

    else {
      tokens.push_back(Token(TOKEN::Minus, "+", line));
    }
    break;
 ........

For operators like /(divide) we do not immediately return a divide token while encountering a / but we match the next character as well , as if the next char is a / it becomes a comment or say the next char is a = sign then the token become /=(divide equal).

That’t pretty much it . Using these you can implement a working lexer .

Once this piece of code has been written error free we should have a vector (list) of tokens .

These tokens in itself does not provide any meaning to the machine , Thats what our next step will be

To learn and create a Parser that would structure this code as a tree of tokens which could be used to make some sense out of the code and also perform error handling.

Thanks for Reading and you could ask questions or give feedbacks if any.

0
Subscribe to my newsletter

Read articles from Sawez Faisal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sawez Faisal
Sawez Faisal

New to the field and eager to learn how complex systems work smoothly. From building compilers to scalable systems, I’m solving problems as they come—whatever the domain—and sharing all the highs, lows, and lessons along the way!