You Don't Know Json-Parser
Introduction
Recently I have been taking up some projects and implementing them in Golang. So, this week I took up the challenge of learning the inner workings of a JSON parser and enforcing it in the Golang programming language. In this blog, I will first talk about the theory of a JSON parser and then about implementation details.
Working off a JSON Parser
A JSON parser mainly works in two steps
Lexical Analysis: In this step, the input is converted into an array of tokens.
Syntactic Analysis: In this step, the tokens are parsed to check whether they align with the rules of the specified language (JSON).
Lexical Analysis
Suppose we have a JSON content like this
"key": "value",
"key2": true
}
To effectively divide the JSON content into a list of tokens, here is the algorithm used
// does lexical analysis over the input json to create lex tokens
func Lexer(content []byte) []string {
var isInsideString bool = false
var prevByte byte
var currBytes []byte
var lexResult []string
for _, byt := range content {
if byt != DoubleQuoteByte || prevByte == BackSlashByte {
// all other cases
if isInsideString {
currBytes = append(currBytes, byt)
} else {
var present bool = false
for _, cs := range CheckSlice {
if cs == byt {
present = true
break
}
}
if present {
// if current char is any of { [ : , ] }
if len(currBytes) > 0 {
// for cases like `true,` -> to add the `true` token since for (true | null | false) -> there is no end of token character
lexResult = append(lexResult, string(currBytes))
currBytes = nil
}
// add the current slice
lexResult = append(lexResult, string(byt))
} else {
// for cases like (true | false | null)
currBytes = append(currBytes, byt)
}
}
} else {
// double quote chars enclosing the string
currBytes = append(currBytes, byt)
if isInsideString {
// if the double quote is at the end of string
lexResult = append(lexResult, string(currBytes))
currBytes = nil
}
// toggle end of string flag
isInsideString = !isInsideString
}
prevByte = byt
}
return lexResult
}
The above algorithm outputs the following tokens (the individual tokens are enclosed within single quotes. Now as you can see there is still some cleaning that needs to be done on the generated tokens.
'{'
'
"key"'
':'
' "value"'
','
'
"key2"'
':'
' null
'
'}'
During the cleaning step, the algorithm just trims the extra spaces around the individual tokens. The outputs look like
'{'
'"key"'
':'
'"value"'
','
'"key2"'
':'
'true'
'}'
Since the tokens are generated, we are done with the lexical analysis part and ready to dive into the syntactic analysis part to see whether the token list can make up a valid JSON object or not.
Syntactic Analysis
Since the tokens can be a string, null, true, false, colon( : ), left curly brace( { ), right curly brace( } ), left square brace( [ ), right square brace( ] ), or a comma. So, we will apply the below algorithm to parse the token list.
token = tokenList[0]
tokenList = tokenList[1:]
if token == '{'
then apply IsValidObject(tokenList)
else if token == '['
then apply IsValidArray(tokenList)
else
apply IsValidString(tokenList) or IsValidNumber(tokenList) or IsValidBoolean(tokenList) or IsValidNull(tokenList)
Here is the Parser method in Golang
func Parser(lexToken []string) ([]string, bool) {
var result bool = false
if len(lexToken) < 1 {
return lexToken, result
}
token := lexToken[0]
if token == LeftCurlyBrace {
lexToken = lexToken[1:]
lexToken, result = IsValidObject(lexToken)
} else if token == LeftSquareBrace {
lexToken = lexToken[1:]
lexToken, result = IsValidArray(lexToken)
} else {
result = IsValidString(token) || IsValidNumber(token) || IsValidBoolean(token) || IsValidNull(token)
lexToken = lexToken[1:]
}
return lexToken, result
}
The methods mentioned above are described below one by one.
IsValidObject
As you can see a valid object is a series of <key, colon, value, comma> pairs. So, if we loop through the tokenList we can tell whether it is valid or not. A key point to note here is that the last pair does not have a comma. So here is the implementation for IsValidObject in Golang.
// function to validate weather `lexToken` contains a valid object
func IsValidObject(lexToken []string) ([]string, bool) {
if len(lexToken) == 0 {
return lexToken, false
}
if lexToken[0] == RightCurlyBrace {
return lexToken[1:], true
}
for true {
// for json key
if len(lexToken) == 0 {
return lexToken, false
}
json_key := lexToken[0]
lexToken = lexToken[1:]
if !IsValidString(json_key) {
return lexToken, false
}
// for colon
if len(lexToken) == 0 {
return lexToken, false
}
json_colon := lexToken[0]
lexToken = lexToken[1:]
if json_colon != Colon {
return lexToken, false
}
// for json value
if len(lexToken) == 0 {
return lexToken, false
}
json_value := lexToken[0]
lexToken = lexToken[1:]
if json_value == LeftCurlyBrace {
var tempResult bool
lexToken, tempResult = IsValidObject(lexToken)
if !tempResult {
return lexToken, false
}
} else if json_value == LeftSquareBrace {
var tempResult bool
lexToken, tempResult = IsValidArray(lexToken)
if !tempResult {
return lexToken, false
}
} else if !IsValidString(json_value) && !IsValidNumber(json_value) && !IsValidBoolean(json_value) && !IsValidNull(json_value) {
return lexToken, false
}
if len(lexToken) == 0 {
return lexToken, false
}
json_comma_or_curly_brace := lexToken[0]
lexToken = lexToken[1:]
if json_comma_or_curly_brace == RightCurlyBrace {
break
}
if json_comma_or_curly_brace != Comma {
return lexToken, false
}
}
return lexToken, true
}
IsValidArray
An array in a JSON file is a series of objects or numbers or boolean values(true and false) or null. Hence implementing this function is quite straightforward. In other words, the array should contain a list of JSON values. Here is the implementation in Golang.
func IsValidArray(lexToken []string) ([]string, bool) {
if len(lexToken) == 0 {
return lexToken, false
}
token := lexToken[0]
if token == RightSquareBrace {
return lexToken[1:], true
}
for true {
var tempResult bool
lexToken, tempResult = Parser(lexToken)
if tempResult == false {
return lexToken, false
}
token := lexToken[0]
if token == RightSquareBrace {
return lexToken[1:], true
} else if token != Comma {
return lexToken[1:], false
}
lexToken = lexToken[1:]
}
return lexToken, true
}
IsValidString
A string in JSON begins and ends with double quotation symbols. All Unicode characters are placed within. Although certain characters are needed to escape if the previous character is an escape character. These are the characters that can appear after an escape character.
Quotation mark "
Reverse Solidua \
Solidus /
Backspace b
Formfeed f
Linefeed n
Carriage Return r
Horizontal Tab t
Here is the code for recognizing a valid string.
func IsValidString(input string) bool {
inputLen := len(input)
if inputLen < 2 {
return false
}
var prevChar rune
if input[0] != DoubleQuoteByte || input[inputLen-1] != DoubleQuoteByte {
return false
}
for _, ch := range input {
if prevChar == BackSlashRune {
var validAfterBackSlash bool = false
for _, allowedChar := range AllowedCharsAfterEscapeChar {
if allowedChar == ch {
validAfterBackSlash = true
break
}
}
if !validAfterBackSlash {
return false
}
prevChar = 0
} else {
prevChar = ch
}
}
return true
}
IsValidNumber
For number validation, if a token can be converted to a valid int or float, then it is a valid JSON number value.
func IsValidNumber(input string) bool {
_, atoiErr := strconv.Atoi(input)
_, floatErr := strconv.ParseFloat(input, 64)
if atoiErr != nil && floatErr != nil {
return false
}
return true
}
IsValidBoolean
It is simple, it should be either true or false.
IsValidBoolean(input string) bool {
return input == "true" || input == "false"
}
IsValidNull
It is also simple, it should be null.
func IsValidNull(input string) bool {
return input == "null"
}
Subscribe to my newsletter
Read articles from Makx directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Makx
Makx
I’m a Computer Science undergraduate from India, passionate about programming 💻 and exploring the digital world. Besides coding, I enjoy digital media creation, Gym, and football ⚽.