Skip to content

Emit geoarrow.wkb Arrow metadata for geometry columns #340

@jatorre

Description

@jatorre

Summary

Databricks geometry columns arrive as EWKT strings (SRID=4326;POINT(55.4 25.2)) in Arrow string fields. The Arrow field metadata already labels them (Spark:DataType:SqlName: GEOMETRY(4326)), but the driver doesn't convert them to a standard geospatial Arrow format.

Request

Have the driver emit geoarrow.wkb Arrow extension metadata on geometry columns, converting EWKT→WKB in the IPC reader. This would allow consumers like DuckDB's adbc_scanner to map geometry to native GEOMETRY automatically — no ST_AsBinary() on the Databricks side or ST_GeomFromWKB() on the client side.

The Redshift ADBC driver already does this — its geometry columns arrive with ARROW:extension:name: geoarrow.wkb metadata, and DuckDB maps them to native GEOMETRY with zero conversion needed.

Proof of concept

I built a patch in ipc_reader_adapter.go that:

  1. Detects geometry columns via Spark:DataType:SqlName metadata
  2. Converts EWKT→WKB per row using go-geom (WKT parse + WKB marshal)
  3. Replaces String arrays with Binary arrays + ARROW:extension:name: geoarrow.wkb

It works — DuckDB sees native GEOMETRY, GeoParquet output includes geo metadata with WKB encoding, bbox, geometry_types.

However, the per-row WKT parsing in Go is ~25% slower for points and much slower for complex polygons compared to just using ST_AsBinary() server-side. The ideal solution would be for the driver (or databricks-sql-go) to emit WKB directly from the server, avoiding WKT string serialization entirely.

Current workaround

-- Databricks side: explicit binary conversion
SELECT *, ST_AsBinary(geom) as geom_wkb FROM table
-- DuckDB side: explicit geometry conversion
ST_GeomFromWKB(geom_wkb) as geom

Desired behavior

-- Just SELECT * — geometry arrives as native GEOMETRY via geoarrow.wkb
SELECT * FROM adbc_scan(conn, 'SELECT * FROM table')

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions