Z Order in Delta Lake - Part 2

In my previous article on Z Order for Data Lake we delved into the underlying algorithm of Z order curve and saw how elegantly the data is organized that suffices the overall optimization of the search process and how instead of vertical efficiency of B trees in traditional indexes the approach in Z order is now moved to a more square like efficiency.

But there are a few caveats that one should be aware of in this approach.

The square like searches by default fail to provide efficient result especially for range searches because we cant just take some random region and expect a efficient result that only touches the points within that search range.

Note : I used the following online convertor for binary to decimal conversions and vice versa.

https://www.rapidtables.com/convert/number/binary-to-decimal.html

The Problem

Lets take a range search on two dimensional data for points (0,0) and (6,6).

Z order Curve

In the image above, the range spans from (0,0) marked in red to (6,6) in green .The “bounding box” is marked in purple. This looks so efficient as all the required points are located in the span marked in purple without any overspills to additional regions as all points are have a continuous increasing value. This looks like the most ideal outcome one would expect with range searches when the underlying data spans over tera to peta bytes in size.

Lets now look at another set of range query. How about a range from (2,2) and (6,3).

Z order

Our desired data is located in the brown bounded box in the below image.

Z order Curve

What do you think how would the search progress in this case ? Is it going to be as optimal as when the range query was for points (0,0) and (6,6) ?

Unfortunately that isn’t the case. Refer to the image below

Z order

The search (marked with black lines) will proceed touching almost all the data points that are outside of the desired search area(marked in brown). It’s a simple range query similar to the previous one but with different search range but it touches a whole lot of false points, which makes the query just as inefficient as one without having the need to organize the data with Z order curve in the first place.

Is it possible to optimize this ? Theoretically the answer is Yes.

The solution lies in a research paper that’s more than 43 years ago by German mathematician/computer scientist H.Tropf and H.Herzog. I will not got into the details of the paper. You can refer to it in the link below.

https://hermanntropf.de/media/multidimensionalrangequery.pdf

Back to our problem. The solution is that if we can somehow figure out a way by which the search is limited only to this specific area defined by our range query.

Z order Curve

This way we can skip the unnecessary regions the search touches. One possible solution is that the algorithm defines a new internal search range that overrides the original search ranges and the new search range limits the search only to the above search area. This can be achieved by creating one or multiple squares of continuous increasing value with a new defined search range.

Solution

First we define false steps. What it means is that after how many x number of false steps should we determine that the search is now outside the desired search area.

Z order

In our case for example if we set the value for false steps to 4, it means that 4 steps after the last valid value in the bounded box we know that the search is now out of limit.

Z order Curve

15 is the last valid value in the bounded box after it starts at 12 and the search moves out of the boundary .So 4 steps beyond 15 is 19 and this indicates that the search has moved beyond the relevant search area. Once we understand that we then proceed to define a new search range.

In the first step, we start by picking up the Z point index values which are 12 and 45 defined by our search range query. Their xy coordinates are (2,2) and (6,3).

We then get their binary values.

(2) » 010 & (2) » 010

(6) » 110 & (3) » 011

12 » (2,2) » (010,010)

45 » (6,3) » (101,011)

Interleaving bits for 12 » (2,2) » (010,010) » 001100

Interleaving bits for 45 » (6,3) » (110,011) » 101101

The Z curve binary value for 12 is 001100 and for 45 is 101101.

In the next step we determine the first significant bit that differs based on the values.

Note : If it is Y bits then it will result in a horizontal split and if X bits then its a vertical split.

We check for the first significant bit of values of 12(001100) and 45(101101).

Bit 1: 0 (12) vs. 1 (45) → Different. The first significant difference is at the 1st bit.

Now, map the Z-order bits:

  1. z1 = x1

  2. z2 = y1

  3. z3 = x2

  4. z4 = y2

  5. z5 = x3

  6. z6 = y3

Since the first difference occurs at z1, it corresponds to x1, which is the 1st bit of the x-coordinate indicates that split is horizontal over the bounded box.

Now that we have figured out that the split will be horizontal we know that the upper x boundary can be inherited from the value of 45 and the lower x boundary from 12.

What we don’t know are the two x values just above and below this division line that defines the division. The two points that we want to identify, are calledLitMax and BigMin.

The research paper referred earlier defines LitMax and BigMin. LitMax’s x value is created with combination of the common most significant bit from the x values followed by a 0 and then 1’s while BigMin’s x value is created with combination of the common most significant bits followed by 1 and then 0’s.

We takes x’s values of 12 and 45 as we know that split has to be horizontal.

x of 12 in binary = 010

x of 45 in binary = 110

Next, we find the most significant bits between 010 and 110.

The first bit differs:

0 in 010

1 in 110

As the most significant bits differ there are no common bits at the most significant position which gives us LitMax x value of 011 and Bigmin x value of 100 resulting in the LitMax point being (011,010) and BigMin point being (100,011), where 010 is the y value for 12 and 011 is the x value for 45.

Interleaving LitMax point (011,010) results in 001110 and interleaving BigMin point (100,011) results in 100101.

Converting these binary numbers to decimal we get

001110 » 14 and 100101 » 37

Z order Curve

It means that the horizontal split will be between Z values of 14 & 37.

Z order

With the calculation done the initial bounding box (2,2) to (6,3) is now split in two bounding boxes : (2,2) to (3,3) and (4,3) to (6,3) .

This now lead’s to two short scans, first between (2,2) to (3,3) and second between (4,3) to (6,3) instead of one large scan that was unnecessarily touching other regions of no interest. The first scan is marked in red and second scan in green.

Z order Curve

Compare the above new scan to the previous scan and it definitely looks a lot more efficient.

Z order

Conclusion

This concludes the two-part series on Z-order and Z-order curves. We explored how Z-order elegantly converts multi dimensional data into one-dimensional values while preserving locality, making it effective for faster range searches for multidimensional data.

The most striking aspect is the beautifully crafted solution in a research paper that is more than 40 years old for todays problem when fastest computers in those times were running on only a few MB’s of memory and few CPU cycles.

I always hold huge respect for people who work in pure research. They often have solutions for future problems. Its just that technology has not kept in pace with their findings and this solution serves as a testament of their geniuses on how they anticipate problems and provide beautiful solutions with absence of latest technologies to test them. The most famous example being of Einstein who postulated existence of Black holes decades earlier before we had the technology to see them.

That’s all for now.. Thanks for reading !!!

0
Subscribe to my newsletter

Read articles from Sachin Nandanwar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sachin Nandanwar
Sachin Nandanwar