Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065
Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065paleolimbot wants to merge 5 commits into
Conversation
| // Geometry with default CRS (defaults to OGC:CRS84 per Parquet spec) | ||
| (LogicalType::geometry(None), r#"{"crs":"OGC:CRS84"}"#), | ||
| // Geometry with srid:0 should result in an unset (omitted) CRS | ||
| (LogicalType::geometry(Some("srid:0".to_string())), r#"{}"#), | ||
| // Geometry with custom CRSes (authority:code and partial projjson) | ||
| ( | ||
| LogicalType::geometry(Some("EPSG:4267".to_string())), | ||
| r#"{"crs":"EPSG:4267"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geometry(Some( | ||
| r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string(), | ||
| )), | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#, | ||
| ), | ||
| // Geography with default CRS (default OGC:CRS84, spherical edges) | ||
| ( | ||
| LogicalType::geography(None, None), | ||
| r#"{"crs":"OGC:CRS84","edges":"spherical"}"#, | ||
| ), | ||
| // Geography with explicit edges | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::SPHERICAL)), | ||
| r#"{"crs":"OGC:CRS84","edges":"spherical"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)), | ||
| r#"{"crs":"OGC:CRS84","edges":"karney"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)), | ||
| r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)), | ||
| r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)), | ||
| r#"{"crs":"OGC:CRS84","edges":"thomas"}"#, | ||
| ), | ||
| // Geometry with srid:0 should result in an unset (omitted) CRS | ||
| // and spherical edges | ||
| ( | ||
| LogicalType::geography(Some("srid:0".to_string()), None), | ||
| r#"{"edges":"spherical"}"#, | ||
| ), | ||
| // Geography with custom CRSes (authority:code and partial projjson) | ||
| ( | ||
| LogicalType::geography(Some("EPSG:4267".to_string()), None), | ||
| r#"{"crs":"EPSG:4267","edges":"spherical"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography( | ||
| Some(r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string()), | ||
| None, | ||
| ), | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4326}},"edges":"spherical"}"#, | ||
| ), |
There was a problem hiding this comment.
These are the test cases for reading Parquet files. For GeoArrow implementation to correctly recognize the intent of the CRS field in the Parquet file, this is the GeoArrow extension metadata that must be produced in each of these cases.
There were a few problems with the existing implementation:
LogicalType::geometry(None)(i.e., Parquet default CRS) was read as as{}(which in GeoArrow land means "I don't know what the CRS is)LogicalType::geography(Some(...), None)(i.e. Parquet default edge algorithm) was read as{"crs":"..."}(i.e., no edges in the GeoArrow metadata, which consumers would recognize as GEOMETRY and not GEOGRAPHY)LogicalType::geography(Some(...), Some(EdgeInterpolationAlgorithm::SPHERICAL))(i.e. an actual edge algorithm) was read as{"crs":"...","algorithm":"spherical"}. The"algorithm"key isn't valid in GeoArrow and this would be rejected by consumers or ignored and read as GEOMETRY (not geography).
| // Geometry with no CRS should be GEOMETRY(srid:0) | ||
| (r#"{}"#, LogicalType::geometry(Some("srid:0".to_string()))), | ||
| // Geometry with string CRS | ||
| ( | ||
| r#"{"crs":"EPSG:4267"}"#, | ||
| LogicalType::geometry(Some("\"EPSG:4267\"".to_string())), | ||
| ), | ||
| // Geometry with PROJJSON CRS | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":3857}}}"#, | ||
| LogicalType::geometry(Some( | ||
| r#"{"id":{"authority":"EPSG","code":3857}}"#.to_string(), | ||
| )), | ||
| ), | ||
| // Geometry with lon/lat CRSes (canonically removed because lon/lat is the | ||
| // default Parquet CRS) | ||
| (r#"{"crs":"OGC:CRS84"}"#, LogicalType::geometry(None)), | ||
| (r#"{"crs":"EPSG:4326"}"#, LogicalType::geometry(None)), | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#, | ||
| LogicalType::geometry(None), | ||
| ), | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":"4326"}}}"#, | ||
| LogicalType::geometry(None), | ||
| ), | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"OGC","code":"CRS84"}}}"#, | ||
| LogicalType::geometry(None), | ||
| ), | ||
| // Geography with no CRS, spherical edges | ||
| ( | ||
| r#"{"edges":"spherical"}"#, | ||
| LogicalType::geography(Some("srid:0".to_string()), None), | ||
| ), | ||
| // Geography with OGC:CRS84 and spherical edges | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"spherical"}"#, | ||
| LogicalType::geography(None, None), | ||
| ), | ||
| // Geography with different edge algorithms | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"karney"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)), | ||
| ), | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)), | ||
| ), | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)), | ||
| ), | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"thomas"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)), | ||
| ), | ||
| // Geography with custom CRS and edges | ||
| ( | ||
| r#"{"crs":"EPSG:4267","edges":"karney"}"#, | ||
| LogicalType::geography( | ||
| Some("\"EPSG:4267\"".to_string()), | ||
| Some(EdgeInterpolationAlgorithm::KARNEY), | ||
| ), | ||
| ), | ||
| // Geography with PROJJSON CRS | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4267}},"edges":"spherical"}"#, | ||
| LogicalType::geography( | ||
| Some(r#"{"id":{"authority":"EPSG","code":4267}}"#.to_string()), | ||
| None, | ||
| ), | ||
| ), |
There was a problem hiding this comment.
These are the tests for writing. The main issue with writing was that "edges":"spherical", which GeoArrow uses to communicate GEOGRAPHY, was not actually writing a geography type because the implementation expected "algorithm". An unset CRS (geoarrow metadata {}) was also written as a Parquet default CRS which is technically incorrect but also not common.
|
cc @BlakeOrth |
|
Thanks for the ping, I'll make some time to review this today. |
BlakeOrth
left a comment
There was a problem hiding this comment.
The changes here look good to me, I didn't find anything meaningful that would require any follow-up. Thanks for taking care of this!
alamb
left a comment
There was a problem hiding this comment.
Thank you @paleolimbot This all looks reasonable to me, but I am not enough of an expert to understand what the downstream implications of this are
@kylebarron I wonder if you might have some time to help review this PR and lend your specialized geoarrow knowledge ?
@paleolimbot do you have any sense
- Is this a "breaking change" (as in could we put it out in a minor version or should we wait for the next major version)?
- DO you have any suggestions on how to review this PR? Like other implementations I can cross reference with or links to the spec, etc?
| // Geometry with string CRS | ||
| ( | ||
| r#"{"crs":"EPSG:4267"}"#, | ||
| LogicalType::geometry(Some("\"EPSG:4267\"".to_string())), |
There was a problem hiding this comment.
Is it right to have embedded quotes here? It seems strange (maybe it is because it is a JSON string and thus needs to be a valid JSON value)?
LogicalType::geometry(Some("EPSG:4267".to_string())),The same thing can be seen below too
|
Sorry, I haven't been following the latest updates in the Parquet native geometry type spec 🥲 |
kylebarron
left a comment
There was a problem hiding this comment.
Overall it looks good but I didn't take a super close look
Which issue does this PR close?
Rationale for this change
There were several issues with conversion identified when I tried to integrate this in SedonaDB and that came to light when the spec was recently clarified.
I am sorry for missing these changes when I reviewed the initial implementation.
What changes are included in this PR?
Nonefor geometry or geography is now converted to a GeoArrow CRS of"OGC:CRS84"(the named value for the default CRS in the Parquet spec)"srid:0"is now converted to a GeoArrow "omitted" CRS. This was recently clarified in the Parquet spec (srid:0 is a named example in the list of allowed values)."srid:0"None. This logic was included in the previous implementation but was reversed (Parquet CRSes that looked like lonlat were omitted when written to GeoArrow, which is not correct)."algorithm"and was serializing it to JSON. The GeoArrow spec uses the"edges"key. This led to invalid metadata being generated which was either rejected or incorrectly interpreted by consumers.Are these changes tested?
Yes. I added high-level end-to-end LogicalType <-> extension metadata tests, since that is what matters (there were a few lower level tests that I updated as well).