You Don't Know Json-Parser

MakxMakx
6 min read

Introduction

Recently I have been taking up some projects and implementing them in Golang. So, this week I took up the challenge of learning the inner workings of a JSON parser and enforcing it in the Golang programming language. In this blog, I will first talk about the theory of a JSON parser and then about implementation details.

Working off a JSON Parser

A JSON parser mainly works in two steps

  • Lexical Analysis: In this step, the input is converted into an array of tokens.

  • Syntactic Analysis: In this step, the tokens are parsed to check whether they align with the rules of the specified language (JSON).

Lexical Analysis

Suppose we have a JSON content like this


  "key": "value",
  "key2": true
}

To effectively divide the JSON content into a list of tokens, here is the algorithm used

// does lexical analysis over the input json to create lex tokens
func Lexer(content []byte) []string {
    var isInsideString bool = false
    var prevByte byte
    var currBytes []byte
    var lexResult []string
    for _, byt := range content {
        if byt != DoubleQuoteByte || prevByte == BackSlashByte {
            // all other cases
            if isInsideString {
                currBytes = append(currBytes, byt)
            } else {
                var present bool = false
                for _, cs := range CheckSlice {
                    if cs == byt {
                        present = true
                        break
                    }
                }
                if present {
                    // if current char is any of { [ : , ] }
                    if len(currBytes) > 0 {
                        // for cases like `true,` -> to add the `true` token since for (true | null | false) -> there is no end of token character
                        lexResult = append(lexResult, string(currBytes))
                        currBytes = nil
                    }
                    // add the current slice
                    lexResult = append(lexResult, string(byt))
                } else {
                    // for cases like (true | false | null)
                    currBytes = append(currBytes, byt)
                }
            }
        } else {
            // double quote chars enclosing the string
            currBytes = append(currBytes, byt)
            if isInsideString {
        // if the double quote is at the end of string
                lexResult = append(lexResult, string(currBytes))
                currBytes = nil
            }
      // toggle end of string flag
            isInsideString = !isInsideString
        }
        prevByte = byt
    }
    return lexResult
}

The above algorithm outputs the following tokens (the individual tokens are enclosed within single quotes. Now as you can see there is still some cleaning that needs to be done on the generated tokens.

'{'
'
  "key"'
':'
' "value"'
','
'
  "key2"'
':'
' null
'
'}'

During the cleaning step, the algorithm just trims the extra spaces around the individual tokens. The outputs look like

'{'
'"key"'
':'
'"value"'
','
'"key2"'
':'
'true'
'}'

Since the tokens are generated, we are done with the lexical analysis part and ready to dive into the syntactic analysis part to see whether the token list can make up a valid JSON object or not.

Syntactic Analysis

Since the tokens can be a string, null, true, false, colon( : ), left curly brace( { ), right curly brace( } ), left square brace( [ ), right square brace( ] ), or a comma. So, we will apply the below algorithm to parse the token list.

token = tokenList[0]
tokenList = tokenList[1:]
if token == '{'
    then apply IsValidObject(tokenList)
else if token == '['
    then apply IsValidArray(tokenList)
else 
    apply IsValidString(tokenList) or IsValidNumber(tokenList) or IsValidBoolean(tokenList) or IsValidNull(tokenList)

Here is the Parser method in Golang

func Parser(lexToken []string) ([]string, bool) {
  var result bool = false
  if len(lexToken) < 1 {
    return lexToken, result
  }
  token := lexToken[0]
  if token == LeftCurlyBrace {
    lexToken = lexToken[1:]
    lexToken, result = IsValidObject(lexToken)
  } else if token == LeftSquareBrace {
    lexToken = lexToken[1:]
    lexToken, result = IsValidArray(lexToken)
  } else {
    result = IsValidString(token) || IsValidNumber(token) || IsValidBoolean(token) || IsValidNull(token)
    lexToken = lexToken[1:]
  }
    return lexToken, result
}

The methods mentioned above are described below one by one.

IsValidObject

As you can see a valid object is a series of <key, colon, value, comma> pairs. So, if we loop through the tokenList we can tell whether it is valid or not. A key point to note here is that the last pair does not have a comma. So here is the implementation for IsValidObject in Golang.

// function to validate weather `lexToken` contains a valid object
func IsValidObject(lexToken []string) ([]string, bool) {
    if len(lexToken) == 0 {
        return lexToken, false
    }
    if lexToken[0] == RightCurlyBrace {
    return lexToken[1:], true
    }
    for true {
        // for json key
        if len(lexToken) == 0 {
            return lexToken, false
        }
        json_key := lexToken[0]
        lexToken = lexToken[1:]
        if !IsValidString(json_key) {
            return lexToken, false
        }

        // for colon
        if len(lexToken) == 0 {
            return lexToken, false
        }
        json_colon := lexToken[0]
        lexToken = lexToken[1:]
        if json_colon != Colon {
            return lexToken, false
        }

        // for json value
        if len(lexToken) == 0 {
            return lexToken, false
        }
        json_value := lexToken[0]
        lexToken = lexToken[1:]
        if json_value == LeftCurlyBrace {
            var tempResult bool
            lexToken, tempResult = IsValidObject(lexToken)
            if !tempResult {
                return lexToken, false
            }
        } else if json_value == LeftSquareBrace {
            var tempResult bool
            lexToken, tempResult = IsValidArray(lexToken)
            if !tempResult {
                return lexToken, false
            }
        } else if !IsValidString(json_value) && !IsValidNumber(json_value) && !IsValidBoolean(json_value) && !IsValidNull(json_value) {
            return lexToken, false
        }

        if len(lexToken) == 0 {
            return lexToken, false
        }
        json_comma_or_curly_brace := lexToken[0]
        lexToken = lexToken[1:]
        if json_comma_or_curly_brace == RightCurlyBrace {
            break
        }
        if json_comma_or_curly_brace != Comma {
            return lexToken, false
        }
    }
    return lexToken, true
}

IsValidArray

An array in a JSON file is a series of objects or numbers or boolean values(true and false) or null. Hence implementing this function is quite straightforward. In other words, the array should contain a list of JSON values. Here is the implementation in Golang.

func IsValidArray(lexToken []string) ([]string, bool) {
  if len(lexToken) == 0 {
    return lexToken, false
  }
  token := lexToken[0]
  if token == RightSquareBrace {
    return lexToken[1:], true
  }

  for true {
    var tempResult bool
    lexToken, tempResult = Parser(lexToken)
    if tempResult == false {
      return lexToken, false
    }
    token := lexToken[0]
    if token == RightSquareBrace {
      return lexToken[1:], true
    } else if token != Comma {
      return lexToken[1:], false
    } 
    lexToken = lexToken[1:]
  }

  return lexToken, true
}

IsValidString

A string in JSON begins and ends with double quotation symbols. All Unicode characters are placed within. Although certain characters are needed to escape if the previous character is an escape character. These are the characters that can appear after an escape character.

  • Quotation mark "

  • Reverse Solidua \

  • Solidus /

  • Backspace b

  • Formfeed f

  • Linefeed n

  • Carriage Return r

  • Horizontal Tab t

Here is the code for recognizing a valid string.

func IsValidString(input string) bool {
    inputLen := len(input)
  if inputLen < 2 {
    return false
  }
    var prevChar rune
    if input[0] != DoubleQuoteByte || input[inputLen-1] != DoubleQuoteByte {
        return false
    }
    for _, ch := range input {
        if prevChar == BackSlashRune {
            var validAfterBackSlash bool = false
            for _, allowedChar := range AllowedCharsAfterEscapeChar {
                if allowedChar == ch {
                    validAfterBackSlash = true
                    break
                }
            }
            if !validAfterBackSlash {
                return false
            }
            prevChar = 0
        } else {
            prevChar = ch
        }
    }
    return true
}

IsValidNumber

For number validation, if a token can be converted to a valid int or float, then it is a valid JSON number value.

func IsValidNumber(input string) bool {
    _, atoiErr := strconv.Atoi(input)
    _, floatErr := strconv.ParseFloat(input, 64)
    if atoiErr != nil && floatErr != nil {
        return false
    }
    return true
}

IsValidBoolean

It is simple, it should be either true or false.

IsValidBoolean(input string) bool {
    return input == "true" || input == "false"
}

IsValidNull

It is also simple, it should be null.

func IsValidNull(input string) bool {
    return input == "null"
}
0
Subscribe to my newsletter

Read articles from Makx directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Makx
Makx

I’m a Computer Science undergraduate from India, passionate about programming 💻 and exploring the digital world. Besides coding, I enjoy digital media creation, Gym, and football ⚽.