Skip to content

Spatial Joins

Spatial polars can perform a spatial join to join two dataframes based on geomtric predicate.

Not lazy

Spatial joins are currently only implemented for DataFrames they are not yet available for LazyFrames.

Thanks to Natural Earth

This example reads zipped data from Natural Earth, big thanks to them for putting this data out there for us to use!

Made with Natural Earth. Free vector and raster map data @ naturalearthdata.com.

To demonstrate how we can join data together from two dataframes spatially using spatial polars, we'll join some lake polygons with some administrative boundaries to see which lakes are in which countries.

Spatial Join
import polars as pl

from spatial_polars import scan_spatial

lake_df = (
    scan_spatial("https://naciscdn.org/naturalearth/110m/physical/ne_110m_lakes.zip")
    .select("name", "geometry")
    .collect(engine="streaming")
)  # (1)!
print(f"There are {len(lake_df)} rows in lake_df")

boundary_df = (
    scan_spatial(
        "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
    )
    .select("SOVEREIGNT", "geometry")
    .collect(engine="streaming")
)  # (2)!

lake_boundary_df = (
    lake_df.spatial.join(  # (3)!
        other=boundary_df,  # (4)!
        how="inner",  # (5)!
        predicate="intersects",  # (6)!
        on="geometry",  # (7)!
        suffix="_boundary",  # (8)!
    )
    .select(
        pl.col("name"),  # (9)!
        pl.col("SOVEREIGNT"),
        pl.col("geometry"),
        pl.col("geometry_boundary"),
    )
    .sort("name")  # (10)!
)
print(lake_boundary_df)
  1. Reading the lakes with only the lake's name and geometry
  2. Reading the boundaries with only the country's name (SOVEREIGNT) and geometry
  3. Starting with the lake_df we'll start our spatial join
  4. Specifying to join the lakes to this boundary_df
  5. We'll use an inner join so as to only return rows for lakes that actually intersect a boundary. If a lake does not intersect a boundary polygon we won't have a row for that lake in our output dataframe. Likewise, if a boundary doesn't intersect a lake, the resulting dataframe won't have a row for that boundary.

    Note

    This could have been left off, because how='inner' is the default

  6. Use the 'intersects' spatial predicate so if any part of the lake shares any space with the boundary we'll join the lake to the boundary. Since we've specified an inner join, if a lake intersects more than one boundary, we'll get more than one row for the lake since it's joined to more than one boundary.

    Note

    This could have been left off, because predicate='intersects' is the default

  7. Since the name of the geometry struct is 'geometry' in both of our dataframes, we will use the on parameter, if we wanted to use a different column name for each of the dataframes we could use the left_on or right_on parameters.

    Note

    This could have been left off, because on='geometry' is the default

  8. Since we're joining the dataframes with a common column name (geometry), a suffix must be applied to the columns of the right dataframe that have names that exist in the left fram, because we can't have two columns with the same name. we'll use "_boundary" as the suffix to clarify that the geometry of the right frame came from the boundaries dataframe.

  9. Selecting the columns to make the lake name and SOVEREIGNT columns show up before the lake and boundary geometry columns.
  10. Sort by the lake name just to make the results look nice in our output dataframe.
There are 24 rows in lake_df
shape: (36, 4)
┌──────────────────┬──────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
│ name             ┆ SOVEREIGNT               ┆ geometry                        ┆ geometry_boundary               │
│ ---              ┆ ---                      ┆ ---                             ┆ ---                             │
│ str              ┆ str                      ┆ struct[2]                       ┆ struct[2]                       │
╞══════════════════╪══════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
│ Cedar Lake       ┆ Canada                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x06\x00\x00\x00\x1e\x0… │
│ Great Bear Lake  ┆ Canada                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x06\x00\x00\x00\x1e\x0… │
│ Great Salt Lake  ┆ United States of America ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x06\x00\x00\x00\x0a\x0… │
│ Great Slave Lake ┆ Canada                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x06\x00\x00\x00\x1e\x0… │
│ Lago Titicaca    ┆ Bolivia                  ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x03\x00\x00\x00\x01\x0… │
│ …                ┆ …                        ┆ …                               ┆ …                               │
│ Lake Victoria    ┆ Kenya                    ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x03\x00\x00\x00\x01\x0… │
│ Lake Victoria    ┆ Uganda                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x03\x00\x00\x00\x01\x0… │
│ Lake Winnipeg    ┆ Canada                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x06\x00\x00\x00\x1e\x0… │
│ Reindeer Lake    ┆ Canada                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x06\x00\x00\x00\x1e\x0… │
│ Vänern           ┆ Sweden                   ┆ {b"\x01\x03\x00\x00\x00\x01\x0… ┆ {b"\x01\x03\x00\x00\x00\x01\x0… │
└──────────────────┴──────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Of the original 24 lakes and 177 bounaries, we have 36 rows now, beacause there are a few that cross the borders of the boundaries and were joined to more than one. Lake Victoria is one of these lakes, it intersects both Kenya and Uganda

The resulting dataframe has a row for each lake name/geometry from the lakes dataframe and the SOVEREIGNT and geometry_boundary from the boundary df where the lake intersects an admin boundary.

Many GIS don't allow for multiple geometries on a single row, but spatial polars has no issue with this, and is actually something that we can use to our advantage.