b-Square: geospatial scalable filtering

At HUB Ocean, we work with massive volumes of geospatial marine data, with some datasets containing billions of rows. To ensure fast queries, we developed b-Square — a simple but effective mechanism to index geometries in tabular formats like Parquet.

How It Works

We store geometries using WKT or WKB, and during ingestion, each record is expanded with three extra columns:

x: center longitude

y: center latitude

q: maximum deviation from the center (in degrees), forming a bounding square

This gives each geometry an approximate bounding box:

x - q ... x + q

y - q ... y + q

Why It Matters

Users often run spatial queries like:

WHERE ST_Intersects(geometry, ?)

To make these fast, we internally translate them into bounding-box comparisons using the x, y, and q columns before doing exact geometry checks:

WHERE x + q >= $x0 AND x - q <= $x1 AND y + q >= $y0 AND y - q <= $y1

This significantly reduces the work required to evaluate user queries by applying a three-stage filtering process:

1. File-Level Pruning

Parquet datasets include an index file that lists all data files and their column-wise min/max statistics, including x, y, and q.

We evaluate the query bounding box against each file's min/max. If a file cannot possibly match the query, we skip it entirely.

PyArrow can't handle this type of compound spatial filtering, but we use our internal query AST to evaluate expressions against file-level stats.

2. Row-Level Filtering

If a file passes pruning, we convert the AST into a PyArrow filter and evaluate it row by row — without decoding the geometries.

This cuts down candidates significantly and avoids expensive geometry parsing unless necessary.

3. Final Geometry Filtering

The remaining rows are parsed and evaluated using the actual geometry via operations like st_intersects.

Why Use a Square Instead of a Bounding Box?

Bounding squares simplify indexing. Using just one value (q) avoids adding extra partition columns, which can reduce pruning efficiency. We can also bucket q values for performance gains.

Caveats: Negation Is Hard

Negated queries like:

WHERE NOT ST_Intersects(geometry, ?)

...can’t safely skip files just because their bounding square might intersect the filter. That would risk false negatives.

In short: if there's any chance that a file contains a matching geometry, we must include it. Bounding-square filters are great for positive matches, but negations break their guarantees unless handled conservatively.

What’s Next

Improve how dynamic indexing works
Optimize handling of q by clustering high values together, and partition the rest by x and y

b-Square: Efficient Geospatial Indexing for Tabular Data