Algorithms of building a Proximity Service to Discover Nearby Gems

Souvik MukherjeeSouvik Mukherjee
13 min read

In Part 1 of our series, "Building a Proximity Service to Discover Nearby Gems," we laid the groundwork by defining the functional and non-functional requirements, performing back-of-the-envelope estimations, and proposing a high-level design. Now, in Part 2, we'll delve into the core of our proximity service: the search algorithms that enable efficient two-dimensional searches. Specifically, we'll explore geospatial indexing methods, focusing on hash-based and tree-based approaches, and discuss their respective advantages and limitations.

Algorithms to Find Nearby Gems

The most intuitive but naive way to get nearby businesses is to draw a circle with a predefined radius and find all the businesses within the circle. This might look something like this:

SELECT business_id, latitude, longitude
FROM business
WHERE (latitude BETWEEN {:my_lat} - radius AND {:my_lat} + radius)
AND (longitude BETWEEN {:my_long} - radius AND {:my_long} + radius)

However, this approach is not efficient as it requires scanning the entire table. The problem is that we have two-dimensional data, and the dataset returned from each dimension could be huge. Additionally, performing an intersection operation on the two datasets is not efficient.

The issue with the previous approach is that the database index can only improve search speed in one dimension. So naturally, we need to think of a way to map the two dimensions into one dimension. Before diving into specific indexing methods, let's explore different types of indexing methods.

When it comes to two-dimensional search, there are primarily two categories of methods:

  1. Hash-based Methods

  2. Tree-based Methods

Hash-based Methods

Hash-based methods convert geographic coordinates (latitude and longitude) into hash values that can be efficiently indexed and queried. The most common hash-based methods are:

  • Even Grid

  • Geohash

Even Grid

One simple approach is to evenly divide the world into smaller grids. This way, one grid could have multiple gems, and each gem on the map belongs to a grid. This approach works to an extent but has a major issue: the distribution of businesses is not even.

Imagine dividing a city into a 10x10 grid, where each cell represents a 1 km x 1 km area. In a densely populated area, one grid cell might contain hundreds of businesses, while in a less populated area, a grid cell might contain only a few or none. This uneven distribution can lead to inefficiencies in searching and indexing.

Geohash

Geohash is a better option than an evenly divided grid. It works by reducing the two-dimensional data into one-dimensional data. It encodes latitude and longitude into a single string of letters and digits. The length of the Geohash string determines the precision of the grid cell.

Here is a table showing the relationship between Geohash length, grid size, and radius of search:

Geohash LengthGrid Width x HeightRadius of Search
15,000km x 5,000km~2,500km
21,250km x 625km~630km
3156km x 156km~78km
439km x 19.5km~20km
54.9km x 4.9km~2.4km
61.2km x 0.61km~0.61km
7152m x 152m~76m
838m x 19m~19m

Boundary Issues in Geohash

While Geohash is a powerful method, it comes with its own set of boundary issues:

  1. Close Locations with No Shared Prefix

    Locations that are very close but have no shared prefix at all can cause inefficiencies. This can occur when locations are near the edges of Geohash cells, resulting in different Geohash prefixes even for nearby points. Here is a more accurate example:

    | Location | Latitude | Longitude | Geohash | | --- | --- | --- | --- | | Location A | 0.0001 | -0.0001 | 7zzzz | | Location B | -0.0001 | 0.0001 | bpbpb |

    In this case, Location A and Location B are very close to each other but have completely different Geohash prefixes (7zzzz and bpbpb) because they are on opposite sides of the equator and prime meridian.

    In SQL, to find nearby locations, you would need to query multiple Geohash prefixes:

     SELECT * FROM businesses WHERE geohash LIKE '7zzzz%' OR geohash LIKE 'bpbpb%';
    
  2. Long Shared Prefix with Different Boundaries

    Another issue arises when locations have a long shared prefix but fall on different grid boundaries. This typically occurs when:

    • Points are along the diagonal of a Geohash cell: As you move diagonally across a Geohash cell, you cross into different cells even though the points are nearby.

    • Precision level is not sufficient: Lower precision levels mean larger cells, which can cause nearby points to fall into different cells.

LocationLatitudeLongitudeGeohash
Location C40.7128-74.0060dr5r7
Location D (Nearby)40.7129-74.0061dr5r8

These locations are close but require searching across multiple Geohash cells. A common solution is the fetch businesses not only within current grid but also from its neighbours. The geohashes of neighbours can be calculated in constant time.

  1. Not Enough Neighbors

    Geohash cells near the edges of the search area might not have enough neighboring cells to provide accurate results, requiring additional logic to fetch and merge data from adjacent cells.

Tree-based Methods

Tree-based methods organize spatial data into hierarchical structures that can efficiently manage varying data densities. The most common tree-based methods are:

  • Quadtree

  • Google S2

Quadtree

A Quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. They are particularly useful for dynamic datasets where the density of points varies significantly.

20.2. The PR Quadtree — OpenDSA Data Structures and Algorithms Modules  Collection

How Quadtrees Work
  1. Initialization:

    • Start with a square region that represents the entire dataset.

    • Each node in the quadtree represents a rectangular region of space.

  2. Subdivision:

    • If a node contains more points than a certain threshold, it is subdivided into four child nodes, each representing one quadrant of the original region.

    • This process is recursive, with each child node being further subdivided if necessary.

  3. Insertion:

    • To insert a point, start at the root node.

    • Determine which child node (quadrant) the point belongs to and move to that child node.

    • Repeat the process until reaching a leaf node.

    • If the leaf node exceeds the threshold after the insertion, subdivide the leaf node.

Pseudo Code for Building Quadtree
public void buildQuadTree(TreeNode node){
    if(countNumberofBusinessInCurrentGrid(node) > 100) {
        node.subdivide();
        for(TreeNode child: node.getChildren()){
            buildQuadTree(child);
        }
    }
}
Operational Considerations

When implementing and using a Quadtree, there are several operational considerations to keep in mind:

  1. Threshold for Subdivision:

    • The choice of the threshold (the maximum number of points a node can hold before it is subdivided) affects the performance and efficiency of the Quadtree. A low threshold results in deeper trees and more subdivisions, which can increase memory usage and the time required to traverse the tree. A higher threshold may lead to fewer subdivisions but can result in large, less precise nodes.
  2. Balancing:

    • As data is dynamically inserted or removed, the Quadtree can become unbalanced, leading to inefficiencies. Periodically rebalancing the tree can help maintain optimal performance. Rebalancing involves collapsing and redistributing points to achieve a more balanced structure.
  3. Memory Usage:

    • Each subdivision increases memory usage as new nodes are created. Efficient memory management techniques, such as using object pooling or limiting the depth of the tree, can help manage memory usage.
  4. Performance:

    • The performance of operations like insertion, deletion, and querying depends on the depth and balance of the tree. Optimizations like lazy subdivision (delaying subdivision until absolutely necessary) and using efficient boundary checks can improve performance.
  5. Concurrency:

    • In multi-threaded environments, concurrent access to the Quadtree can lead to race conditions. Implementing thread-safe mechanisms, such as locks or concurrent data structures, ensures the integrity of the Quadtree during concurrent operations.
  6. Dynamic Data:

    • For applications with dynamic data, where points are frequently added and removed, it’s important to handle updates efficiently. This can involve lazy deletion (marking nodes for deletion and cleaning up periodically) and incremental rebalancing.

Example

Consider a Quadtree managing a dataset of businesses in a city. Initially, the entire city is one region. As businesses are added, the regions with high densities of businesses are subdivided to create more precise, smaller regions. This allows efficient querying of businesses in specific areas, making the Quadtree highly suitable for dynamic and unevenly distributed datasets.

By keeping these operational considerations in mind, you can ensure that the Quadtree implementation remains efficient, scalable, and capable of handling the dynamic nature of real-world data.

Google S2

Google S2 is a powerful geometric library designed for spatial indexing and querying. It provides a way to map the surface of a sphere (such as the Earth) onto a one-dimensional index. This is achieved using a space-filling curve, specifically the Hilbert curve. The Hilbert curve is a space-filling curve that traverses a two-dimensional grid in a manner that preserves locality. Its a way to create a continuous path that covers every point in a square grid with minimal movement.

This means that points that are close together on the surface of the sphere will also be close together in the one-dimensional index, making spatial queries efficient.

How Google S2 Works
  1. Mapping the Sphere to a Cube:

    • The Earth's surface is first mapped onto the faces of a cube. This involves projecting the spherical coordinates (latitude and longitude) onto the six faces of a cube. Each face of the cube can then be treated as a two-dimensional plane.
  2. Using the Hilbert Curve:

    • On each face of the cube, a Hilbert curve is used to map the two-dimensional plane into a one-dimensional sequence. The Hilbert curve is a space-filling curve that ensures that points that are close together in two dimensions remain close together in the one-dimensional sequence.
  3. Generating S2 Cells:

    • The S2 library divides the cube faces into a hierarchy of cells. Each cell can be subdivided into four child cells, similar to a quadtree, creating a hierarchical structure. Each cell is uniquely identified by an S2 cell ID, which encodes the cell's position in the hierarchy and on the cube face.
  4. Indexing and Querying:

    • Points on the Earth's surface are indexed using their S2 cell IDs. Spatial queries, such as finding points within a certain radius or within a polygon, can be efficiently executed by leveraging the hierarchical nature of the S2 cells and the spatial locality preserved by the Hilbert curve.
S2 Cell Structure
  • S2 Cell ID: A unique identifier for each cell, encoding its position and level in the hierarchy.

  • Hierarchy: S2 cells can be subdivided recursively into smaller cells, creating a multi-level hierarchy that allows for varying levels of spatial precision.

Advantages of Google S2
  1. Geofencing:

    • Efficiency: S2 is highly efficient for geofencing, where you need to determine whether a point lies within a specified geographic boundary. The hierarchical nature of S2 cells allows for quick inclusion checks.

    • Precision: The ability to subdivide cells to varying levels of precision means you can have very fine-grained control over the geofence boundaries.

    • Spatial Locality: The use of the Hilbert curve ensures that geographically close points remain close in the one-dimensional index, making range queries and geofencing operations faster.

  2. Scalability:

    • S2 can handle datasets ranging from small to very large scales, thanks to its hierarchical structure and efficient indexing.

    • It supports dynamic datasets where points are frequently added, removed, or updated.

  3. Flexibility:

    • S2 supports various spatial operations, including containment, intersection, and proximity searches.

    • It can be used for applications beyond geofencing, such as spatial clustering, heat maps, and route planning.

Operational Considerations for S2
  1. Cell Precision:

    • The precision of S2 cells can be adjusted to balance between accuracy and performance. Higher precision provides more accurate spatial indexing but increases the complexity and memory usage.
  2. Indexing Strategy:

    • Efficient indexing strategies, such as precomputing S2 cell IDs for common geofences or using spatial indexes, can significantly improve query performance.
  3. Concurrency:

    • In multi-threaded environments, ensure thread-safe operations on S2 cells and polygons, especially when updating geofences or handling large-scale data queries.
  4. Memory Management:

    • Efficient memory management techniques, such as caching frequently accessed S2 cells or using memory pools, can help manage the memory footprint of the S2 data structures.
Example of Geofencing with Google S2

Consider a geofencing application for a delivery service. The geofence defines the delivery area, and you need to check whether a customer's location falls within this area.

  1. Define the Geofence:

    • Create an S2Polygon representing the delivery area using the boundary vertices.
  2. Check Customer Location:

    • Convert the customer's latitude and longitude to an S2 cell ID.

    • Use the S2Geofence class to check if the customer's cell ID is within the geofence polygon.

By leveraging the power of Google S2, you can build efficient and scalable geofencing solutions, making it an excellent choice for applications requiring precise and fast spatial operations.

Geohash vs. Quadtree: Usage by Companies and Comparison

Companies Using Geohash and Quadtree

CompanyGeohashQuadtree
UberYesNo
MongoDB (Geospatial Queries)YesNo
ElasticsearchYesNo
Google MapsNoYes
Microsoft Bing MapsNoYes
MapboxYesNo
OpenStreetMapYesNo

Geohash vs. Quadtree

Feature/AspectGeohashQuadtree
ConceptEncodes latitude and longitude into a single string, converting 2D data into 1D.Tree structure where each node is subdivided into four quadrants.
Precision ControlControlled by the length of the Geohash string.Controlled by the depth of the tree and the threshold for subdivision.
Spatial LocalityPreserves spatial locality reasonably well, but can suffer from boundary issues.Preserves spatial locality well; close points are likely to be in the same or adjacent nodes.
Boundary IssuesLocations close to each other but in different Geohash cells can be problematic; requires querying multiple cells.Boundaries are naturally handled through recursive subdivisions, reducing the need for complex querying logic.
Data DistributionNot well-suited for highly uneven data distributions; some cells may be densely populated, others sparse.Handles varying data densities efficiently; nodes are subdivided as needed.
Efficiency for Fixed-radius SearchesEfficient for fixed-radius searches due to hierarchical structure.Efficient, particularly for spatial queries involving containment and proximity.
K-Nearest NeighborsLess efficient; requires merging results from multiple Geohash cells.More efficient; hierarchical structure allows efficient nearest-neighbor searches.
Implementation ComplexityRelatively simple to implement and use.More complex to implement due to recursive nature and tree management.
Common Use CasesGeospatial queries, fixed-radius searches, indexing locations in databases.Spatial indexing in mapping services, dynamic datasets with varying densities, complex spatial queries.
Examples of UsageUber, MongoDB, Elasticsearch, Mapbox, OpenStreetMap.Google Maps, Microsoft Bing Maps.

Conclusion

In this part of our series, we've explored the essential algorithms for two-dimensional search, focusing on hash-based methods like Geohash and tree-based methods like Quadtree. Understanding these methods allows us to choose the most appropriate algorithm based on the specific requirements of our proximity service. In the next and last part of the series, we will dive deep into designing the service. Stay tuned for the implementation details and practical insights on building a robust proximity service.

0
Subscribe to my newsletter

Read articles from Souvik Mukherjee directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Souvik Mukherjee
Souvik Mukherjee

Hello! I'm Souvik Mukherjee, a passionate Cloud and Devops Engineer with a strong commitment to building scalable and robust backend infrastructure. My background is rich in experience, encompassing the design and implementation of complex backend architectures, with a special focus on serverless and microservices stack.