Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for geo data #91

Open
Robinlovelace opened this issue Jan 20, 2024 · 7 comments
Open

Support for geo data #91

Robinlovelace opened this issue Jan 20, 2024 · 7 comments

Comments

@Robinlovelace
Copy link

This looks great! One feature request I have in mind is support (either via new functions/functionality or via documentation if it works out of the boxx) for spatial data. See this by @cboettig for inspiration: https://github.com/cboettig/duckdbfs#spatial-data

Another potential source of inspiration is sf's support for tidy operations, it great how summarise() and other functions 'just work' with tidy verbs: https://r-spatial.github.io/sf/reference/tidyverse.html

@krlmlr
Copy link
Member

krlmlr commented Jun 30, 2024

Thanks for raising this, Robin. Integration with the duckdb spatial extension would be a really cool feature, but also a lot of work.

Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?

Adding support for functions is then "only" a matter of diligence: https://github.com/duckdblabs/duckplyr/pull/179/files#diff-a202cfba76540d6822868ac7755edd4945b6344057d78e0092f4836e33c0d4eaR11 .

@Robinlovelace
Copy link
Author

Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?

I imagine so, and given that everything other than the geometry column is already sorted, it's just the geometry that needs converting (safe to assume just 1 geometry column in 99% of use cases I think).

@Robinlovelace
Copy link
Author

Seems like DuckDB -> sf has been implemented here: https://github.com/cboettig/duckdbfs/blob/main/R/to_sf.R

Not sure how hard the other way would be let alone how to make it fast.

@cboettig
Copy link

The duckdb -> sf conversion there is mostly solid, but could be a bit better. Currently there's a couple different ways in which geospatial data is stored in duckdb:

  • If duckdb reads in a vector format file (shapefile, geodatabase, anything BUT geoparquet), it parses with gdal and converts to duckdb's internal geometry. This is the use-case that the above handles. (Though I think the column name for the geometry is inherited from the file, e.g. might not be called geometry, so really we need to handle this.

  • If duckdb reads in geoparquet, it does not use the gdal parser (because duckdb's native parquet parser is so much faster!). However, this also means (at least currently) that the column is read in as a binary blob and not the native geometry, so we need an extra coercion. I've been meaning to add this, though it might eventually be solved upstream, see distinguish between a column in WKB and a column in native 'geometry' format? duckdb/duckdb_spatial#299 (comment)

Re sf -> duckdb, I don't think this is much of an issue, though there are various ways to do it depending on precisely what you mean by "to duckdb". Specifically, I think the best thing to do is simply have sf write out as a geoparquet file to disk. (this assumes sf is built with recent gdal that has arrow support of course!). Since presumably this use case means the data is small enough to fit in RAM, writing out as, say, geodatabase is probably just as good (maybe better given the issue noted above), and then have duckdb read that in. It is possible to write to duckdb's native database format with DBI instead (i.e. with the WKB-binary column), and then you'd need the extra coercion once in duckdb to make it into duckdb's internal spatial type, but I don't see the use for that. (For most users I think it's actually better to pretend that duckdb's native database doesn't exist and work directly against flat files).

Sorry, long story short, I think duckdbfs should handle both cases (simply noting that sf should serialize to disk in any standard spatial format), modulo this edge case about geoparquet.

@mdsumner
Copy link

mdsumner commented Aug 15, 2024

I would use {wk} for the sf<->wkb<->blob conversion, it supports a wide range of other conversions already (not terra::vect sadly).

Should BLOB type be already supported?

I see

## wget https://data.source.coop/fused/overture/2024-02-15-alpha-0/theme=admins/type=administrativeBoundary/0.parquet
duckplyr::duckplyr_df_from_parquet("0.parquet")
Error: rel_to_altrep: Unknown column type for altrep: BLOB

This would otherwise look like this

arrow::open_dataset("0.parquet") |> dplyr::select(geometry) |> dplyr::collect() |>  dplyr::mutate(geometry = wk::wkb(geometry))
# A tibble: 2,587 × 1
   geometry
   <wk_wkb>
 1 <LINESTRING (-175.3083 -21.12098, -175.3094 -21.12427, -175.3098 -21.12571, …
 2 <LINESTRING (-175.2667 -21.14462, -175.2673 -21.14619, -175.2681 -21.14822, …
 3 <LINESTRING (-175.2686 -21.12686, -175.2684 -21.12997, -175.2692 -21.13471, …

I don't think any of the spatial stuff belongs here, unless an import of wk is welcome ... I suggest return the binary as-is, or as {blob}. For sf itself it has st_as_sf() and handles this more generic basis provided by wk. {RODBC} fwiw did support this geometry read way back in the 2000s, and worked well with various backends but that long predated dataframe and blob vector support.

For general read via GDAL, I would look at the vector support in {gdalraster} and (we can do it!) work on a lazy vctrs form for the OGR pointer type, alternatively GDAL can provide geos pointers directly to {geos}. sf doesn't have any capacity for these lazy or alternative/intermediate forms for the geometry from general sources so I don't think it's a good thing to focus on always (it's well supported by conversions already).

@cboettig
Copy link

cboettig commented Aug 15, 2024

It seems the goal for duckplyr for spatial should aim to expose to the R user the spatial abilities of duckdb directly.

The ibis project in Python seems like a natural analogue here -- as you probably know already ibis is essentially a dbplyr for python. When using the duckdb backend engine, it supports many though not yet all of the spatial abilities in duckdb[spatial], returning a geopandas data.frame if the user calls to_pandas() (which is essentially ibis analogy to collect(). e.g https://ibis-project.org/posts/ibis-duckdb-geospatial-dev-guru/

@Robinlovelace
Copy link
Author

Cool stuff, keeping a beady eye on this conversation, thanks for keeping it rolling forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants