Default empty array() return type to ARRAY<VARCHAR>#5421
Conversation
PPL's `array()` no-arg form delegates to Calcite's `SqlLibraryOperators.ARRAY` return-type inference, which returns ARRAY<NULL> for an empty operand list and ARRAY<UNKNOWN> when all operands are typeless nulls. Both markers are fine for the v2 engine — `ArrayImplementor.internalCast` only consumes the element type when there are elements to cast, so an empty result Object list flows straight through to ExprCollectionValue regardless of declared element type. The analytics-engine route is stricter. When isthmus walks a RexCall like `mvjoin(array(), '-')`, it reaches its first operand's type and feeds it to `io.substrait.isthmus.TypeConverter.toSubstrait`, which throws `UnsupportedOperationException: Unable to convert the type UNKNOWN`. Substrait has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't serialize the call at all. Two PPL ITs hit this directly: * `CalciteArrayFunctionIT.testMvjoinWithEmptyArray` * `CalciteArrayFunctionIT.testMvdedupWithEmptyArray` Substituting VARCHAR when the inferred element type is NULL or UNKNOWN gives the call a substrait-serializable type without affecting any value computation: the result list is empty either way. # Test plan * Unit tests: `:core:test --tests "*ArrayFunction*"` — passes locally (no existing tests asserted on the empty-array element type). * IT: `CalciteArrayFunctionIT` force-routed through the analytics-engine path via opensearch-project/OpenSearch#21554's plugin set — testMvjoinWithEmptyArray and testMvdedupWithEmptyArray now pass (were UNKNOWN type errors); pass-rate moved 26/60 → 28/60. Companion to opensearch-project/OpenSearch#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 22d0cce)Here are some key observations to aid the review process:
|
dai-chen
left a comment
There was a problem hiding this comment.
Can we add unit test in ArrayFunctionImplTest?
| RelDataType originalType = | ||
| SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(sqlOperatorBinding); | ||
| RelDataType innerType = originalType.getComponentType(); | ||
| // For empty `array()` Calcite infers element type as NULL, which downstream |
There was a problem hiding this comment.
np: If possible, can we make the comment more concise. I think it's okay to leave most context/details in PR description.
There was a problem hiding this comment.
Trimmed in 22d0cce — comment now just points to the PR description.
| if (innerType == null || isUnknownLikeType(innerType.getSqlTypeName())) { | ||
| innerType = typeFactory.createSqlType(SqlTypeName.VARCHAR); | ||
| } |
There was a problem hiding this comment.
The idea is CAST(ARRAY[] to VARCHAR[])? Add an UT to cover this assumption.
Question, Is it limitation of Substrait, becuase it does not supoort UNKNOW?
There was a problem hiding this comment.
Refer PostgreSQL, it does not allow construct empty ARRAY[].
https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS
You can construct an empty array, but since it's impossible to have an array with no type, you must explicitly cast your empty array to the desired type. For example:
SELECT ARRAY[]::integer[];
array
-------
{}
(1 row)
But in PPL, we allow create emtpy array(). So which means we implict covert to varchar[] in future, right?
There was a problem hiding this comment.
Yes — semantically equivalent to CAST(ARRAY[] AS VARCHAR[]). The fallback path adds an empty-typed-array case Calcite/Postgres handle via explicit cast, but PPL surface allows bare array() so we do it implicitly.
Yes, Substrait limitation — Substrait's Type proto has no encoding for UNKNOWN (or Calcite's NULL SqlTypeName as an element type). When isthmus' TypeConverter.toSubstrait walks an ARRAY<UNKNOWN> it can't produce a valid wire type and throws UnsupportedOperationException: Unable to convert the type UNKNOWN. The Calcite-engine local executor is more lenient because it never serializes the type — it just iterates the empty List<Object> and never reads the element type.
UT added in aa82704 — testReturnTypeForEmptyCallIsVarcharArray and testReturnTypeForAllNullOperandsIsVarcharArray cover both fallback paths (empty operand list and typeless-NULL operand). testReturnTypeForIntegerOperandPreservesType is the regression guard that confirms concrete element types pass through unchanged.
There was a problem hiding this comment.
Right — that's the trade-off. Postgres requires the explicit cast (SELECT ARRAY[]::integer[]); PPL doesn't have that surface today, so any caller writing bare array() would otherwise hit an analytics-engine-side error. The implicit VARCHAR default keeps the bare form working, and concrete-element calls (array(1, 2), array('a')) keep their original element type — testReturnTypeForIntegerOperandPreservesType in aa82704 is the regression guard for that.
Long-term we could expose a PPL cast(array() as <T>[]) syntax that mirrors Postgres and remove the implicit default, but that's a separate language-surface change.
Cover the four shapes that exercise the return-type inference path introduced in 666dc0e: * array() — 0 operands, fallback fires → ARRAY<VARCHAR> * array(NULL) — typeless-null operand, fallback fires → ARRAY<VARCHAR> * array(1) — INTEGER operand, fallback does NOT fire → ARRAY<INTEGER> * array('a', 'b') — VARCHAR operands, fallback does NOT fire → ARRAY<VARCHAR> The third case is the regression guard requested by review — confirms concrete element types pass through unchanged and the fallback is scoped strictly to the {@code NULL}/{@code UNKNOWN} markers. The harness uses Calcite's {@link ExplicitOperatorBinding} bound to {@link SqlLibraryOperators#ARRAY} so the inference's internal {@code SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(...)} call resolves the same operator the production code delegates to — mocking {@code SqlOperatorBinding} directly hits NPEs deep inside Calcite's least-restrictive-type computation. Addresses review feedback on opensearch-project#5421. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit aa82704 |
The "why" lives in the PR description; the inline comment now points there instead of duplicating it. Addresses dai-chen's review note on Signed-off-by: Kai Huang <ahkcs@amazon.com> opensearch-project#5421.
|
Pushed aa82704 (UTs) and 22d0cce (trim inline comment). Replies to the open questions: @dai-chen — "with this change, we actually define VARCHAR[] array(), right?" Yes, but only for the bare/empty form. Concrete-element calls keep the inferred element type:
@penghuo — "CAST(ARRAY[] to VARCHAR[])? Substrait limitation?" Yes to both, replied inline. @penghuo — "Postgres requires Yes — PPL surface allows bare |
|
Persistent review updated to latest commit 22d0cce |
9124664
into
opensearch-project:feature/mustang-ppl-integration
Description
PPL's
array()no-arg form currently returnsARRAY<NULL>(andARRAY<UNKNOWN>when all operands are typeless nulls). The Calcite engine's in-process executor tolerates either:ArrayImplementor.internalCastonly consumes the element type when there are elements to cast, so an empty list flows straight through toExprCollectionValueregardless of the declared element type.The analytics-engine route is stricter. When isthmus walks a RexCall like
mvjoin(array(), '-'), it eventually feeds the first operand's type toio.substrait.isthmus.TypeConverter.toSubstrait, which throwsUnsupportedOperationException: Unable to convert the type UNKNOWN. Substrait has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't serialize the call at all. Two ITs hit this directly:CalciteArrayFunctionIT.testMvjoinWithEmptyArrayCalciteArrayFunctionIT.testMvdedupWithEmptyArraySubstituting
VARCHARwhen the inferred element type isNULLorUNKNOWNgives the call a substrait-serializable type without affecting any value computation — the result list is empty either way.Test plan
:core:test --tests "*ArrayFunction*"— passes (no existing test asserted on the empty-array element type).CalciteArrayFunctionITforce-routed through the analytics-engine path via [Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route OpenSearch#21554's plugin set — pass-rate 26/60 → 28/60, withtestMvjoinWithEmptyArrayandtestMvdedupWithEmptyArraynewly green.Companion to opensearch-project/OpenSearch#21554.