Build Your own JSON Parser (Day-2)
Welcome back! I hope you enjoyed the previous blog on this topic. This is a continuation, so let's dive in and discuss today's scope.
Scope
Today, we take a step forward and extend our Lexer as well as Parser to recognize key-value pairs. We focus on recognizing keys and string values. So, let's code our way through.
Lexer
Let's begin by updating our TokenType
and add support for :
, StringKey
and StringValue
. We also update our Token, which now supports a character array(string) and not just a single character.
enum TokenType{
StartObject,
EndObject
EndObject,
StringKey,
StringValue,
KeyValueSeperator
};
typedef struct Token{
char ch[100];
enum TokenType t_type;
}Token;
Now we move on to consume_token
function and modify and extend it.
int consume_token(char token,Token* token_container,int i,char* json_string,int idx_tc){
Token t;
int index = i;
char tc[2];
switch (token)
{
case '{':
tc[0] = token;
tc[1] = '\0'; //explicit null terminator
strcpy(t.ch,tc);
t.t_type = StartObject;
break;
case '}':
tc[0] = token;
tc[1] = '\0';
strcpy(t.ch,tc);
t.t_type = EndObject;
break;
Well, we make some changes on our existing function. Now, we add support for "
and strings inside it. Before, we do it let's define a helper function next_token
which helps us to get the next token of string.
char next_token(char* json_string,int i){
return json_string[i];
}
Now, let's get back to it.
case '"':
char string[100];
int idx = i+1;
char token_c = next_token(json_string,idx);
while(token_c!='"' && idx<strlen(json_string)){
strncat(string,&token_c,1);
++idx;
token_c = next_token(json_string,idx);
}
token_c = next_token(json_string,idx++);
if(strncmp(&token_c,":",1)==0){
strcpy(t.ch,string);
t.t_type = StringKey;
}else{
strcpy(t.ch,string);
t.t_type = StringValue;
}
i = idx-1;
break;
case ':':
tc[0] = token;
tc[1] = '\0';
strcpy(t.ch,tc);
t.t_type = KeyValueSeperator;
break;
We add two more cases, one for "
and another for :
. Let's explain what happens when we encounter "
. When we find it, we capture all the tokens inside until we find a closing "
. After finding the closing "
, it can be either a key or a value. To determine if it's a key or a value, we check if the next_token
is :
. If it is :
, we label the string as StringKey
; otherwise, we label it as StringValue
. We set i=idx-1
so that we capture :
and do not skip it. The :
case is the same as the previous ones; we add the token and label it as KeyValueSeperator
.
We also make some changes in our lexer
function. In the loop, we use next_token
instead of consume_token
. As our function returns the index next to consume after iterating the string, idx_tc
is the index of token_container
which increases after every iteration.
int idx_tc = 0;
for (int i=0;i<len_json_string;){
char token = next_token(json_string,i);
i = consume_token(token,token_container,i,json_string,idx_tc);
idx_tc++;
}
struct Response resp;
resp.token_container = token_container;
resp.length = idx_tc; //size of container
return resp;
}//end of function
Here comes the end of our Lexer for now. Let's move on to parser.
Parser
Today, we are going to introduce some abstractions as well as some new functions. It's going to be easy but a long read. Let's start.
typedef enum ValueType{
STRING,
}ValueType;
We introduce a new enum called ValueType
which store the type information of value.
typedef struct Key{
char key[100];
}Key;
typedef struct Value{
ValueType val;
char value[100];
}Value;
typedef struct KeyValue{
Key Key;
Value Value;
}KeyValue;
typedef struct Object{
KeyValue arr[100];
size_t size;
}Object;
typedef struct ResponseKV{
Object obj;
}ResponseKV;
The code above introduces the following key structs: 1. Key, 2. Value, 3. KeyValue, 4. Object, 5. ResponseKV.
Key currently stores a string of up to 100 characters. Value contains both ValueType and a string value. KeyValue combines both Key and Value. Object holds up to 100 KeyValue pairs (we can extend this if needed). Finally, ResponseKV returns our Object.
Now, we introduce a new function parse_object
, because it all the key values are inside an object and so it acts as a top-level structure(we will cover arrays later,which can also act as top-level structure).
First, we move our code from parser
function to parse_object
.
Object* parse_object(Token* token_container,int len,int* idx,struct Stack* stack){
KeyValue arr[100];
size_t idx_arr = 0;
Object *obj = (Object*)malloc(sizeof(Object));
int i = 0;
for(i=*idx;i<len;i++){
char* token = token_container[i].ch;
enum TokenType type_token = token_container[i].t_type;
if(strcmp(token,"{")==0){
push(stack,token_container[i].ch);
}
else if(strcmp(token,"}")==0){
if((isEmpty(stack)==1 || strncmp("{",peek(stack),1)!=0)){
printf("Parser Error : Invalid Syntax\nOperation Aborted\n");
free(stack);
exit(EXIT_FAILURE);
}
}
...
So, here is our new function. It takes a pointer to TokenContainer
, the length of TokenContainer
, a pointer to an idx
integer (we'll explain why later), and our stack pointer
. That's all for the parameters.
Now it's time for the body. We have a KeyValue array with a size of 100 (which we can extend if needed), its index, and we create an Object on the heap. The rest was discussed in the previous blog. For now, don't worry about why we are using *idx
instead of 0.
else if(type_token==StringKey && isEmpty(stack)==0){
Key key;
strcpy(key.key,token);
arr[idx_arr].Key = key; //we don't increment so that the key
//maps it's corresponding value.
}
else if(type_token==StringValue){
Value val;
strcpy(val.value,token);
val.val_type = STRING;
arr[idx_arr].Value = val;
idx_arr++;
}
}
So, when we encounter a StringKey
, we check if the stack is empty to ensure that the string appears before the }
. The rest is self-explanatory.
if(i==len && !isEmpty(stack)){
printf("Parser Error : Invalid Syntax\nOperation Aborted\n");
free(stack);
exit(EXIT_FAILURE);
}
After the loop ends, we add a check for syntax to see if there's a }
in our JSON string.
memcpy(obj->arr,arr,100*sizeof(KeyValue));
obj->size = idx_arr;
*idx = i;
return obj;
After the check, we copy our array elements into the Object's arr
, set its size
to idx_arr
, and finally set *idx
to i
(this will make sense later) and return the Object pointer.
ResponseKV parser(Token* token_container,int len){
Object* obj = parse_object(token_container,len,0,arr,&idx_arr);
ResponseKV res;
res->obj = *obj;
return res
}
We receive the Object pointer, create a variable res
of type ResponseKV
, and set res->obj
to the value pointed to by the obj pointer.
We also introduce a lookup function to fetch the value of a given key. It takes a key string and a pointer to its response struct. It returns a value string if the key exists.
char* lookup(char* key,ResponseKV* self){
for(int i=0;i<self->size;++i){
if(strcmp(self->arr[i].Key.key,key)==0){
return self->arr[i].Value.value;
}
}
return "Invalid Key";
}
That's it for the Parser.
Main Function
#include "lexer.h"
#include "parser.h"
int main(){
Response res_lexer;
res_lexer = lexer("{"key":"value"}");
ResponseKV obj = parser(res_lexer.token_container,res_lexer.length);
printf("%s\n",lookup("key",&obj);
free(res_lexer.token_container);
}
Now, if we run this, it will output "value" and then terminate.
Wrapping Up
That's it from my side. In this article, we extended our Lexer and Parser to recognize key-value pairs in JSON strings. We updated the TokenType enum and the Token struct to support string keys and values, as well as the ':' separator. We made changes to the consume_token function to handle string tokens and separators correctly. In the Parser section, we introduced new structures for managing key-value pairs and implemented the parse_object function to parse the tokens into these structures. Finally, we provided a lookup function to retrieve values by their keys and demonstrated the full implementation in a main function.
Our toy parser currently only parses string values. It's up to you now to extend it to support other types and nested structures. I hope you make it your own.
See yaaa!!!!
Subscribe to my newsletter
Read articles from Kaif Khan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kaif Khan
Kaif Khan
CS enthusiast.