Algorithms of building a Proximity Service to Discover Nearby Gems
In Part 1 of our series, "Building a Proximity Service to Discover Nearby Gems," we laid the groundwork by defining the functional and non-functional requirements, performing back-of-the-envelope estimations, and proposing a high-level design. Now, in Part 2, we'll delve into the core of our proximity service: the search algorithms that enable efficient two-dimensional searches. Specifically, we'll explore geospatial indexing methods, focusing on hash-based and tree-based approaches, and discuss their respective advantages and limitations.
Algorithms to Find Nearby Gems
Two Dimensional Search
The most intuitive but naive way to get nearby businesses is to draw a circle with a predefined radius and find all the businesses within the circle. This might look something like this:
SELECT business_id, latitude, longitude
FROM business
WHERE (latitude BETWEEN {:my_lat} - radius AND {:my_lat} + radius)
AND (longitude BETWEEN {:my_long} - radius AND {:my_long} + radius)
However, this approach is not efficient as it requires scanning the entire table. The problem is that we have two-dimensional data, and the dataset returned from each dimension could be huge. Additionally, performing an intersection operation on the two datasets is not efficient.
The issue with the previous approach is that the database index can only improve search speed in one dimension. So naturally, we need to think of a way to map the two dimensions into one dimension. Before diving into specific indexing methods, let's explore different types of indexing methods.
When it comes to two-dimensional search, there are primarily two categories of methods:
Hash-based Methods
Tree-based Methods
Hash-based Methods
Hash-based methods convert geographic coordinates (latitude and longitude) into hash values that can be efficiently indexed and queried. The most common hash-based methods are:
Even Grid
Geohash
Even Grid
One simple approach is to evenly divide the world into smaller grids. This way, one grid could have multiple gems, and each gem on the map belongs to a grid. This approach works to an extent but has a major issue: the distribution of businesses is not even.
Imagine dividing a city into a 10x10 grid, where each cell represents a 1 km x 1 km area. In a densely populated area, one grid cell might contain hundreds of businesses, while in a less populated area, a grid cell might contain only a few or none. This uneven distribution can lead to inefficiencies in searching and indexing.
Geohash
Geohash is a better option than an evenly divided grid. It works by reducing the two-dimensional data into one-dimensional data. It encodes latitude and longitude into a single string of letters and digits. The length of the Geohash string determines the precision of the grid cell.
Here is a table showing the relationship between Geohash length, grid size, and radius of search:
Geohash Length | Grid Width x Height | Radius of Search |
1 | 5,000km x 5,000km | ~2,500km |
2 | 1,250km x 625km | ~630km |
3 | 156km x 156km | ~78km |
4 | 39km x 19.5km | ~20km |
5 | 4.9km x 4.9km | ~2.4km |
6 | 1.2km x 0.61km | ~0.61km |
7 | 152m x 152m | ~76m |
8 | 38m x 19m | ~19m |
Boundary Issues in Geohash
While Geohash is a powerful method, it comes with its own set of boundary issues:
Close Locations with No Shared Prefix
Locations that are very close but have no shared prefix at all can cause inefficiencies. This can occur when locations are near the edges of Geohash cells, resulting in different Geohash prefixes even for nearby points. Here is a more accurate example:
| Location | Latitude | Longitude | Geohash | | --- | --- | --- | --- | | Location A | 0.0001 | -0.0001 | 7zzzz | | Location B | -0.0001 | 0.0001 | bpbpb |
In this case, Location A and Location B are very close to each other but have completely different Geohash prefixes (
7zzzz
andbpbpb
) because they are on opposite sides of the equator and prime meridian.In SQL, to find nearby locations, you would need to query multiple Geohash prefixes:
SELECT * FROM businesses WHERE geohash LIKE '7zzzz%' OR geohash LIKE 'bpbpb%';
Long Shared Prefix with Different Boundaries
Another issue arises when locations have a long shared prefix but fall on different grid boundaries. This typically occurs when:
Points are along the diagonal of a Geohash cell: As you move diagonally across a Geohash cell, you cross into different cells even though the points are nearby.
Precision level is not sufficient: Lower precision levels mean larger cells, which can cause nearby points to fall into different cells.
Location | Latitude | Longitude | Geohash |
Location C | 40.7128 | -74.0060 | dr5r7 |
Location D (Nearby) | 40.7129 | -74.0061 | dr5r8 |
These locations are close but require searching across multiple Geohash cells. A common solution is the fetch businesses not only within current grid but also from its neighbours. The geohashes of neighbours can be calculated in constant time.
Not Enough Neighbors
Geohash cells near the edges of the search area might not have enough neighboring cells to provide accurate results, requiring additional logic to fetch and merge data from adjacent cells.
Tree-based Methods
Tree-based methods organize spatial data into hierarchical structures that can efficiently manage varying data densities. The most common tree-based methods are:
Quadtree
Google S2
Quadtree
A Quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. They are particularly useful for dynamic datasets where the density of points varies significantly.
How Quadtrees Work
Initialization:
Start with a square region that represents the entire dataset.
Each node in the quadtree represents a rectangular region of space.
Subdivision:
If a node contains more points than a certain threshold, it is subdivided into four child nodes, each representing one quadrant of the original region.
This process is recursive, with each child node being further subdivided if necessary.
Insertion:
To insert a point, start at the root node.
Determine which child node (quadrant) the point belongs to and move to that child node.
Repeat the process until reaching a leaf node.
If the leaf node exceeds the threshold after the insertion, subdivide the leaf node.
Pseudo Code for Building Quadtree
public void buildQuadTree(TreeNode node){
if(countNumberofBusinessInCurrentGrid(node) > 100) {
node.subdivide();
for(TreeNode child: node.getChildren()){
buildQuadTree(child);
}
}
}
Operational Considerations
When implementing and using a Quadtree, there are several operational considerations to keep in mind:
Threshold for Subdivision:
- The choice of the threshold (the maximum number of points a node can hold before it is subdivided) affects the performance and efficiency of the Quadtree. A low threshold results in deeper trees and more subdivisions, which can increase memory usage and the time required to traverse the tree. A higher threshold may lead to fewer subdivisions but can result in large, less precise nodes.
Balancing:
- As data is dynamically inserted or removed, the Quadtree can become unbalanced, leading to inefficiencies. Periodically rebalancing the tree can help maintain optimal performance. Rebalancing involves collapsing and redistributing points to achieve a more balanced structure.
Memory Usage:
- Each subdivision increases memory usage as new nodes are created. Efficient memory management techniques, such as using object pooling or limiting the depth of the tree, can help manage memory usage.
Performance:
- The performance of operations like insertion, deletion, and querying depends on the depth and balance of the tree. Optimizations like lazy subdivision (delaying subdivision until absolutely necessary) and using efficient boundary checks can improve performance.
Concurrency:
- In multi-threaded environments, concurrent access to the Quadtree can lead to race conditions. Implementing thread-safe mechanisms, such as locks or concurrent data structures, ensures the integrity of the Quadtree during concurrent operations.
Dynamic Data:
- For applications with dynamic data, where points are frequently added and removed, it’s important to handle updates efficiently. This can involve lazy deletion (marking nodes for deletion and cleaning up periodically) and incremental rebalancing.
Example
Consider a Quadtree managing a dataset of businesses in a city. Initially, the entire city is one region. As businesses are added, the regions with high densities of businesses are subdivided to create more precise, smaller regions. This allows efficient querying of businesses in specific areas, making the Quadtree highly suitable for dynamic and unevenly distributed datasets.
By keeping these operational considerations in mind, you can ensure that the Quadtree implementation remains efficient, scalable, and capable of handling the dynamic nature of real-world data.
Google S2
Google S2 is a powerful geometric library designed for spatial indexing and querying. It provides a way to map the surface of a sphere (such as the Earth) onto a one-dimensional index. This is achieved using a space-filling curve, specifically the Hilbert curve. The Hilbert curve is a space-filling curve that traverses a two-dimensional grid in a manner that preserves locality. Its a way to create a continuous path that covers every point in a square grid with minimal movement.
This means that points that are close together on the surface of the sphere will also be close together in the one-dimensional index, making spatial queries efficient.
How Google S2 Works
Mapping the Sphere to a Cube:
- The Earth's surface is first mapped onto the faces of a cube. This involves projecting the spherical coordinates (latitude and longitude) onto the six faces of a cube. Each face of the cube can then be treated as a two-dimensional plane.
Using the Hilbert Curve:
- On each face of the cube, a Hilbert curve is used to map the two-dimensional plane into a one-dimensional sequence. The Hilbert curve is a space-filling curve that ensures that points that are close together in two dimensions remain close together in the one-dimensional sequence.
Generating S2 Cells:
- The S2 library divides the cube faces into a hierarchy of cells. Each cell can be subdivided into four child cells, similar to a quadtree, creating a hierarchical structure. Each cell is uniquely identified by an S2 cell ID, which encodes the cell's position in the hierarchy and on the cube face.
Indexing and Querying:
- Points on the Earth's surface are indexed using their S2 cell IDs. Spatial queries, such as finding points within a certain radius or within a polygon, can be efficiently executed by leveraging the hierarchical nature of the S2 cells and the spatial locality preserved by the Hilbert curve.
S2 Cell Structure
S2 Cell ID: A unique identifier for each cell, encoding its position and level in the hierarchy.
Hierarchy: S2 cells can be subdivided recursively into smaller cells, creating a multi-level hierarchy that allows for varying levels of spatial precision.
Advantages of Google S2
Geofencing:
Efficiency: S2 is highly efficient for geofencing, where you need to determine whether a point lies within a specified geographic boundary. The hierarchical nature of S2 cells allows for quick inclusion checks.
Precision: The ability to subdivide cells to varying levels of precision means you can have very fine-grained control over the geofence boundaries.
Spatial Locality: The use of the Hilbert curve ensures that geographically close points remain close in the one-dimensional index, making range queries and geofencing operations faster.
Scalability:
S2 can handle datasets ranging from small to very large scales, thanks to its hierarchical structure and efficient indexing.
It supports dynamic datasets where points are frequently added, removed, or updated.
Flexibility:
S2 supports various spatial operations, including containment, intersection, and proximity searches.
It can be used for applications beyond geofencing, such as spatial clustering, heat maps, and route planning.
Operational Considerations for S2
Cell Precision:
- The precision of S2 cells can be adjusted to balance between accuracy and performance. Higher precision provides more accurate spatial indexing but increases the complexity and memory usage.
Indexing Strategy:
- Efficient indexing strategies, such as precomputing S2 cell IDs for common geofences or using spatial indexes, can significantly improve query performance.
Concurrency:
- In multi-threaded environments, ensure thread-safe operations on S2 cells and polygons, especially when updating geofences or handling large-scale data queries.
Memory Management:
- Efficient memory management techniques, such as caching frequently accessed S2 cells or using memory pools, can help manage the memory footprint of the S2 data structures.
Example of Geofencing with Google S2
Consider a geofencing application for a delivery service. The geofence defines the delivery area, and you need to check whether a customer's location falls within this area.
Define the Geofence:
- Create an S2Polygon representing the delivery area using the boundary vertices.
Check Customer Location:
Convert the customer's latitude and longitude to an S2 cell ID.
Use the S2Geofence class to check if the customer's cell ID is within the geofence polygon.
By leveraging the power of Google S2, you can build efficient and scalable geofencing solutions, making it an excellent choice for applications requiring precise and fast spatial operations.
Geohash vs. Quadtree: Usage by Companies and Comparison
Companies Using Geohash and Quadtree
Company | Geohash | Quadtree |
Uber | Yes | No |
MongoDB (Geospatial Queries) | Yes | No |
Elasticsearch | Yes | No |
Google Maps | No | Yes |
Microsoft Bing Maps | No | Yes |
Mapbox | Yes | No |
OpenStreetMap | Yes | No |
Geohash vs. Quadtree
Feature/Aspect | Geohash | Quadtree |
Concept | Encodes latitude and longitude into a single string, converting 2D data into 1D. | Tree structure where each node is subdivided into four quadrants. |
Precision Control | Controlled by the length of the Geohash string. | Controlled by the depth of the tree and the threshold for subdivision. |
Spatial Locality | Preserves spatial locality reasonably well, but can suffer from boundary issues. | Preserves spatial locality well; close points are likely to be in the same or adjacent nodes. |
Boundary Issues | Locations close to each other but in different Geohash cells can be problematic; requires querying multiple cells. | Boundaries are naturally handled through recursive subdivisions, reducing the need for complex querying logic. |
Data Distribution | Not well-suited for highly uneven data distributions; some cells may be densely populated, others sparse. | Handles varying data densities efficiently; nodes are subdivided as needed. |
Efficiency for Fixed-radius Searches | Efficient for fixed-radius searches due to hierarchical structure. | Efficient, particularly for spatial queries involving containment and proximity. |
K-Nearest Neighbors | Less efficient; requires merging results from multiple Geohash cells. | More efficient; hierarchical structure allows efficient nearest-neighbor searches. |
Implementation Complexity | Relatively simple to implement and use. | More complex to implement due to recursive nature and tree management. |
Common Use Cases | Geospatial queries, fixed-radius searches, indexing locations in databases. | Spatial indexing in mapping services, dynamic datasets with varying densities, complex spatial queries. |
Examples of Usage | Uber, MongoDB, Elasticsearch, Mapbox, OpenStreetMap. | Google Maps, Microsoft Bing Maps. |
Conclusion
In this part of our series, we've explored the essential algorithms for two-dimensional search, focusing on hash-based methods like Geohash and tree-based methods like Quadtree. Understanding these methods allows us to choose the most appropriate algorithm based on the specific requirements of our proximity service. In the next and last part of the series, we will dive deep into designing the service. Stay tuned for the implementation details and practical insights on building a robust proximity service.
Subscribe to my newsletter
Read articles from Souvik Mukherjee directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Souvik Mukherjee
Souvik Mukherjee
Hello! I'm Souvik Mukherjee, a passionate Cloud and Devops Engineer with a strong commitment to building scalable and robust backend infrastructure. My background is rich in experience, encompassing the design and implementation of complex backend architectures, with a special focus on serverless and microservices stack.