Spatial Joins 1: Local Spatial Joins

A common type of spatial query involves relating one table of geometric objects (e.g., a table population_centers with columns population, latitude, longitude) with another such table (e.g., a table counties with columns county_name, boundary_wkt), such as calculating for each county the population sum of all population centers contained within it. These kinds of calculations are called spatial joins. While doing it for a single row each from population_centers and counties is manageable, doing it efficiently for two large tables is challenging. In this post, we’ll talk about the machinery that Presto has built to make these queries blazingly fast.

Prologue: Point in Polygon

How do you test if a point is inside a polygon? A classic algorithm is the winding-number test which tests if a ray from the point to x = +infinity intersects the polygon an even or odd number of times. For example, the ray from a point inside a circle would intersect a circle once, while a ray from a point outside would intersect either 0 or 2 times. The even-odd rule holds for more complex polygons as well [1].

We’ll skip the details of the algorithm, but the runtime complexity is important. Without a preprocessing step, we need to check if each side of the polygon intersects the ray. Since the number of sides is equal to the number of vertices, the algorithm runs in O(V) time where V is the number of vertices in the polygon.

Take 1: Double Loop

A correct but very inefficient way of calculating a spatial join is a nested for loop, checking each row of population_centers if it is contained in each row of county. The algorithm will look something like this:

def spatial_join(population_centers, counties):
  results = {}
  for pop in population_centers:
    for county in counties:
      if not county.boundary.contains(pop.latitude, pop.longitude):
        continue
      if county.county_name not in results:
        results[county.county_name] = 0.0
      results[county.county_name] += p.population
  return results

If P is the number of points, and C is the number of counties, then this algorithm will run in O(P * C * V) time, where V is the maximum number of vertices of any polygon. This is very expensive! Detailed polygons can easily contain millions of vertices, some polygon sets (e.g., neighborhoods) can have millions of entries, and some geospatial datasets have billions of points.

Below are the polygons for all 3108 counties in the continental United States. They are comprised of almost 70k vertices.

Running this on some test data takes 652.1 seconds to check ~1.5 million points against 3481 county polygons (some counties have multiple polygons).

Take 2: Envelopes

As humans, often we can quickly look at a point and see it is outside a polygon. For example, consider this point and the polygon of Beaverhead County:

It is far outside the polygon, so we don’t have to check each side of the polygon. In fact, the polygon can be arbitrarily complex: as long as the point is far away from the polygon, we can discard it as a possibility quickly. We can teach the computer to do this by using an envelope, which is an axis-oriented rectangle specified by minimum and maximum x and y values:

The envelope can be calculated in O(V) time, and can be done almost for free when you are deserializing a geometry. Checking if a point is in an envelope is O(1):

envelope.contains(point) ==
                envelope.min_x <= point.x <= envelope.max_x
                and envelope.min_y <= point.y <= envelope.max_y

We can modify our algorithm above to take advantage of this fact:

def spatial_join(population_centers, counties):
  results = {}
  for pop in population_centers:
    for county in counties:
      if not county.envelope.contains(pop.latitude, pop.longitude):
        # Bail quickly!
        continue
      if not county.boundary.contains(pop.latitude, pop.longitude):
        continue
      if county.county_name not in results:
        results[county.county_name] = 0.0
      results[county.county_name] += p.population
  return results

While this does not change our worst-case runtime, it drastically reduces the average runtime (since it removes the dependence on V for most checks). Using county envelopes for our test data reduces the time to 13.8 seconds, an almost 50x improvement!

Using an envelope is so cheap and effective that almost all geometric libraries do an envelope pre-check before any relation check (containment, intersection, etc).

Take 3: Hierarchical Envelopes

Using envelopes makes the check for a given polygon cheaper on average, but we still need to check each polygon. We can do better.

As humans, if we have polygons for each county, and a point in Massachusetts, we know immediately that the point won’t be in any county in California, Ohio, or Florida. A computer version of this is to have a super-envelope for each of the states: for each state, find the maximum and minimum x and y. Since each county already has its envelope, the super-envelope can just be the envelope of the envelopes. For our point in Massachusetts, we can first check the envelope for each state. If only the Massachusetts envelope contains the point, we can then do an envelope check for each county in Massachusetts. Only for those counties that pass their envelope check do we need to do the full containment check. This reduces our total envelope checks from 3108 (counties in the continental US) to 50 (states) plus 14 (counties in Massachusetts).

Adding a check for states improves our performance to 3.4 seconds, about 4x better than using county envelopes alone (and ~200x better than the brute force calculation).

While this is a great improvement, it leads to more questions. Since Massachussetts is in the far northeast of the USA, so why do we have to check each western and southern state? Why not also have envelopes for southern, western, central, eastern, and northern USA? Why stop at three levels? Maybe we can even make envelopes for regions in Massachussetts. What about other data sets without concepts like states and countries? Is there a programmatic way to generate the groups and levels?

R-trees provide an answer for these questions.

Take 4: R-trees

Given a set of geometries, R-trees (Rectangle Trees) provide a programmatic way to construct an efficient set of hierarchical envelopes. R-trees start with an envelope for each polygon in the data set. It groups sets of neighboring polygons, constructing an envelope for each group. The original polygons are leaf nodes in the tree, and the groups are their parent nodes. The maximum group size depends on a parameter called the branching factor. The grouping algorithm then recurses, making parent groups of child groups, constructing an envelope for each, and so on until there is only one node, which encompasses all of our original geometries.

In the case of our counties, first we group them into groups of 9, calculating the bounding box:

Then we group each of these level 1 boxes into groups of 9, calculating their bounding box:

Finally, we repeat this again to create the level 3 boxes:

The final node contains the bounding box for the whole continental USA.

In the example above, if we make an R-tree of the counties of the USA, we’d first check if the point is in the envelope of the root node. If not, the point can’t possibly be in any county, and we’re done. If it is contained, we can iterate through that node’s children, recursing into any whose envelope contains the point. Eventually we will have a (perhaps empty) subset of leaf nodes whose envelopes contain the point, and we can check those counties for proper containment.

While the worst-case time complexity is actually worse than a linear scan through the geometries’ envelopes, average complexity is O(log C + M), where C is again the number of counties and M is the number of matching counties (ie, counties whose envelopes contain the point). Then the time complexity for all P points is O(P * log C + M * V), where M is total number of point-envelope matches, and V is the maximum number of vertices per polygon. This is a significant improvement when C is large.

Using an R-tree on the counties in our test data reduces the calculation to just 1.3s, which is ~2.5x better than the state envelopes, and about 500x faster than the brute force calculation!

R-trees can also help many other spatial queries about proximity. For example, if you want to check which polygons are within a certain distance of a point, instead of querying the R-tree with a point, you can expand the point to an envelope of the appropriate radius, and query the R-tree for envelope-envelope intersections, instead of point-envelope containment.

Local Spatial Joins

When Presto executes a query like

SELECT county_name, SUM(population) AS total_population
FROM population_centers
JOIN counties
ON ST_Contains(ST_GeometryFromWkt(boundary_wkt), ST_Point(longitude, latitude))
GROUP BY county_name

it will create an R-tree of the counties boundary geometries, and stream the rows from population_center. For each row, it will query the R-tree for counties whose envelopes contain the row’s point, and for those candidates it will do a proper containment check against the boundary geometries. It will then emit a row county_name, population for each match, to be later aggregated over county_name.

This procedure works on a single machine, but how do we parallelize spatial joins? We’ll examine that in a separate post on distributed spatial joins.

Acknowledgements

These visualizations were done in collaboration with Jason Sundram in Facebook Boston’s World AI team. The starting point for our visualizations was Mike Bostock‘s D3 US map. For the R-tree visualizations, I used Vladimir Agafonkin‘s RBush, using colors from Carto. Spatial joins in Presto were primarily implemented by Maria Basmanova.

Code for the performance profiling can be found on GitHub.

[1] There are many edge cases — such as a ray that’s tangent to a polygon — as well as important concepts of validity and simplicity in polygons, as well as more robust algorithms, that we are omitting for brevity.