Storage and Retrieval Insights

Henlo frens

We discussed till LSM-tree in the last blog, had to break it down into 2 cause this chapter has too much information.

Many databases use B-trees, it has become the standard. LSM-tree and B-tree both keep the key value pair sorted, but both of they have a different design philosophy.

Log-structured indexes -> break down db into variable size segments, writes a segment sequentially

B-tree indexes -> break the db into fixed size blocks or pages, 4Kb in size, read/writes one page at a time, O(logn) depth

B-tree use a Write ahead log to handle crashes, any data that is to be written in the B-tree gets appended to this file and then gets written in the B-tree.

Comparison between LSM-tree and B-tree. (You can find this anywhere, not gonna explain here sorry :()

Optimisations in B-tree:

Copy-on-Write Scheme
Key Abbreviation
Disk Layout Optimisation
Additional pointers

So apart from primary indexes there are also secondary indexes. They complement primary indexes and enable efficient joins and queries. the difference between them is that in secondary indexes the key values can be non-unique.

Coming to full text search/fuzzy search - your key-value indexing cannot handle full text search as even if there is a simple mistake in the keyword it won't show you any results. Full-text search engines expand queries to include synonyms, ignore grammatical variations, and support proximity searches. Fuzzy search deals with misspelled words.

This is a blog about correcting spellings: https://norvig.com/spell-correct.html

Okay, now lets talk about OLTP (Online Transaction Processing) and OLAP (Online Analytics processing).

OLTP forms the logical unit for the business use case, but using the same db for running analytical queries can slow the db down, so someone came up with OLAP dbs, these dbs are specifically built for analytics.

Data warehouses are for analysts to run their inefficient queries and fetch useless stats to impress the upper management, kidding. Data is piped from OLTP through an ETL (Extract-Transform-Load) pipeline into data warehouses where analysts can run any number of queries. Data warehouses usually follow a star schema. Look up on google. I cannot spoon feed everything, duh.

There are something called Columnar Dbs, as the name suggests well they store the column values together instead of the rows, these dbs are efficient for analytical queries. See tables in warehouses can have hundreds of columns and millions of rows, now going through every row and having the whole row in memory doesn't make sense if the user needs only specific column values, so column-oriented storage comes handy here.

Column-oriented storage, compression, and sorting helps to make read queries faster and make sense in data warehouses, where the load consist on large read-only queries run by business teams.

That's it, the chapter also has a lot of nitty gritties but I won't document them here, I have broadly covered the important things.

DDIA - Chapter 3 - Storage and Retrieval - (Part 2)

Subscribe to my newsletter

Vivek Khatri

Vivek Khatri