Cat Driven Development: Working with JSONB in PostgreSQL (#2)
Context
Following my Purrsuit of optimization method a few days ago I noticed a very slow api on New Relic at , it was our one of the slowest due as per the time taken by the database transcations in it, with an avg response time of just above 3.5s and the slowest response of just under 6.5s. Which is a bit high since it's a simple api being used for internal communication, I expected it to be under 1s.
In this and the next few articles i'll share with you the things I learnt and in the end how i applied it to improve the above mentioned API. Without further ado let's get into it.
Quickly going through the codebase I found both the queries trying to search jsonb columns and so it begins...
PS- I am a self taught developer so I feel like Alice in the wonderland when I start diving deep into some topics, feel free to skip the parts which might not be relevant for you.
What exactly is JSONB?
Postgres 9.4 introduced JSONB which stands for JSON BETTER. As the name suggests it is better than JSON in almost all ways. Before this the most used type was json and hstore. Hstore saves the data in key value pairs, but if that’s what you need it's fine.
JSONB and JSON both save data in the json format but there are some key diffrences in how they store data.
JSONB takes shortcuts for performance reasons:
- JSON data is parsed on input and stored in binary format, key orderings in dictionaries are not maintained, and neither are duplicate keys.
- Accessing individual elements in the JSONB field is fast as it doesn't require parsing the JSON text all the time. On output, JSON data is reconstructed and initial formatting is lost.
- JSONB usually takes more disk space to store than JSON.
When to use JSONB vs JSON
The short answer is in most of the cases you'll end up using JSONB and it will work just great but there is often an important performance penalty as well for unnecessarily JSONB-ing your data.
- If you are doing a lot of JSON manipulation inside PostgreSQL, such as sorting, slicing, splicing, etc., you should use JSONB for speed reasons.
- If you need indexed lookups for arbitrary key searches on JSON, then you should use JSONB.
- If you are doing neither of the above, you should probably use JSON.
- If you need to preserve key ordering, whitespace, and duplicate keys, you should use JSON
There are a few cases where you might need to actually debate between the two.
- For traditional data types, PostgreSQL stores statistics about the distribution of values in each column of each table. This is one of the hidden costs of JSONB: your data doesn’t have statistics, so the query planner is flying blind. So one should consider the pros and cons if you are planning to use JSONB with analytical queries.
- JSONB has a larger table footprint, hence in cases when you are proccessing logs just as an audit trail and dont need to query too often, JSON might be the winner.
Querying in a JSON(B) column
Intially I used to be very confused how these queries work and sort of trial and error-ed for a few months just like i used do with CSS only to realise i started hating CSS after a while just because i never understood the basics clearly. Not wanting the same fate for JSON/JSONB datatypes I went through few articles and i've been thanking myself eversince.
PostgreSQL allows you to query in both JSON and JSONB data in tables, although there are a few operators only for JSONB. I'll try to explain some common ways to do it, but if you want more information the documentation has many more examples and more precise explanations.
-> vs ->>
For most people confusion arises between -> and ->> operators. The difference is very simple, -> returns json(b) and ->> returns text. That's it!
Lets take an example to understand this and the other operators.
Say we have a table user with a metadata JSONB column that stores data something like this:
metadata => { "data1" : [1, 2, 3], "data2" : "some_data", "array" : [ {"hello"=>1}, {"hi"=>2} ], "nest" : {"nest1" : "yes"} }
For simple key search:
SELECT * FROM users WHERE metadata->>'data2' = 'some_data';
you can also use the containment operator(only for JSONB) @> to compare partial JSON strings like
SELECT * FROM users WHERE metadata @> '{"data2": "some_data"}';
For nested search:
Since -> returns a json(b) data, one can use it to traverse through the data.
SELECT * FROM users WHERE metadata->'nest'->>'nest1' = 'yes';
#> vs #>>
Same as the previous operator #> returns json(b) object while #>> returns the string for an array.
For searching within array:
Here we want the first index in the data1 array
SELECT * FROM users WHERE metadata #>> '{data1, 1}' = '2';
For searching within a value that's an array Here we want the value of the first obj in the array
SELECT * FROM users metadata #> '{array, 0}' ->> 'hello' = '1';
Checking for Null
This bit is also something used very often and a tad confusing. Let's say you want to check if a json(b) field has a certain element
SELECT * FROM users where metadata->'attribute' is not NULL;
this is the correct way. Using ->> will give null if it does not exist, or if it does exist and the value is null, e.g.
metadata = '{"attribute":null}'
A better and cleaner way to do it is using the ? operator. Please note this is only for JSONB types and works for top-level attributes only.
SELECT * FROM users WHERE metadata ? 'attribute';
In the next article we will see how indexes work, when to use them and cases when Postgres choses to ignore using them.
PS- We are hiring at Anarock Tech across multiple roles, apply here or get in touch with me.
Subscribe to my newsletter
Read articles from Dhaval Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by