Skip to content

Default empty array() return type to ARRAY<VARCHAR>#5421

Merged
ahkcs merged 3 commits intoopensearch-project:feature/mustang-ppl-integrationfrom
ahkcs:feature/array-empty-default-varchar
May 8, 2026
Merged

Default empty array() return type to ARRAY<VARCHAR>#5421
ahkcs merged 3 commits intoopensearch-project:feature/mustang-ppl-integrationfrom
ahkcs:feature/array-empty-default-varchar

Conversation

@ahkcs
Copy link
Copy Markdown
Collaborator

@ahkcs ahkcs commented May 8, 2026

Description

PPL's array() no-arg form currently returns ARRAY<NULL> (and ARRAY<UNKNOWN> when all operands are typeless nulls). The Calcite engine's in-process executor tolerates either: ArrayImplementor.internalCast only consumes the element type when there are elements to cast, so an empty list flows straight through to ExprCollectionValue regardless of the declared element type.

The analytics-engine route is stricter. When isthmus walks a RexCall like mvjoin(array(), '-'), it eventually feeds the first operand's type to io.substrait.isthmus.TypeConverter.toSubstrait, which throws UnsupportedOperationException: Unable to convert the type UNKNOWN. Substrait has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't serialize the call at all. Two ITs hit this directly:

  • CalciteArrayFunctionIT.testMvjoinWithEmptyArray
  • CalciteArrayFunctionIT.testMvdedupWithEmptyArray

Substituting VARCHAR when the inferred element type is NULL or UNKNOWN gives the call a substrait-serializable type without affecting any value computation — the result list is empty either way.

Test plan

Companion to opensearch-project/OpenSearch#21554.

PPL's `array()` no-arg form delegates to Calcite's `SqlLibraryOperators.ARRAY`
return-type inference, which returns ARRAY<NULL> for an empty operand list and
ARRAY<UNKNOWN> when all operands are typeless nulls. Both markers are fine for
the v2 engine — `ArrayImplementor.internalCast` only consumes the element type
when there are elements to cast, so an empty result Object list flows straight
through to ExprCollectionValue regardless of declared element type.

The analytics-engine route is stricter. When isthmus walks a RexCall like
`mvjoin(array(), '-')`, it reaches its first operand's type and feeds it to
`io.substrait.isthmus.TypeConverter.toSubstrait`, which throws
`UnsupportedOperationException: Unable to convert the type UNKNOWN`. Substrait
has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't
serialize the call at all. Two PPL ITs hit this directly:

  * `CalciteArrayFunctionIT.testMvjoinWithEmptyArray`
  * `CalciteArrayFunctionIT.testMvdedupWithEmptyArray`

Substituting VARCHAR when the inferred element type is NULL or UNKNOWN gives
the call a substrait-serializable type without affecting any value
computation: the result list is empty either way.

# Test plan

* Unit tests: `:core:test --tests "*ArrayFunction*"` — passes locally (no
  existing tests asserted on the empty-array element type).
* IT: `CalciteArrayFunctionIT` force-routed through the analytics-engine path
  via opensearch-project/OpenSearch#21554's plugin set —
  testMvjoinWithEmptyArray and testMvdedupWithEmptyArray now pass (were UNKNOWN
  type errors); pass-rate moved 26/60 → 28/60.

Companion to opensearch-project/OpenSearch#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

PR Reviewer Guide 🔍

(Review updated until commit 22d0cce)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

Copy link
Copy Markdown
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add unit test in ArrayFunctionImplTest?

RelDataType originalType =
SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(sqlOperatorBinding);
RelDataType innerType = originalType.getComponentType();
// For empty `array()` Calcite infers element type as NULL, which downstream
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np: If possible, can we make the comment more concise. I think it's okay to leave most context/details in PR description.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed in 22d0cce — comment now just points to the PR description.

Comment on lines +61 to +63
if (innerType == null || isUnknownLikeType(innerType.getSqlTypeName())) {
innerType = typeFactory.createSqlType(SqlTypeName.VARCHAR);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is CAST(ARRAY[] to VARCHAR[])? Add an UT to cover this assumption.
Question, Is it limitation of Substrait, becuase it does not supoort UNKNOW?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer PostgreSQL, it does not allow construct empty ARRAY[].

https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS

You can construct an empty array, but since it's impossible to have an array with no type, you must explicitly cast your empty array to the desired type. For example:

SELECT ARRAY[]::integer[];
 array
-------
 {}
(1 row)

But in PPL, we allow create emtpy array(). So which means we implict covert to varchar[] in future, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — semantically equivalent to CAST(ARRAY[] AS VARCHAR[]). The fallback path adds an empty-typed-array case Calcite/Postgres handle via explicit cast, but PPL surface allows bare array() so we do it implicitly.

Yes, Substrait limitation — Substrait's Type proto has no encoding for UNKNOWN (or Calcite's NULL SqlTypeName as an element type). When isthmus' TypeConverter.toSubstrait walks an ARRAY<UNKNOWN> it can't produce a valid wire type and throws UnsupportedOperationException: Unable to convert the type UNKNOWN. The Calcite-engine local executor is more lenient because it never serializes the type — it just iterates the empty List<Object> and never reads the element type.

UT added in aa82704testReturnTypeForEmptyCallIsVarcharArray and testReturnTypeForAllNullOperandsIsVarcharArray cover both fallback paths (empty operand list and typeless-NULL operand). testReturnTypeForIntegerOperandPreservesType is the regression guard that confirms concrete element types pass through unchanged.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — that's the trade-off. Postgres requires the explicit cast (SELECT ARRAY[]::integer[]); PPL doesn't have that surface today, so any caller writing bare array() would otherwise hit an analytics-engine-side error. The implicit VARCHAR default keeps the bare form working, and concrete-element calls (array(1, 2), array('a')) keep their original element type — testReturnTypeForIntegerOperandPreservesType in aa82704 is the regression guard for that.

Long-term we could expose a PPL cast(array() as <T>[]) syntax that mirrors Postgres and remove the implicit default, but that's a separate language-surface change.

Cover the four shapes that exercise the return-type inference path
introduced in 666dc0e:

  * array()                  — 0 operands, fallback fires → ARRAY<VARCHAR>
  * array(NULL)              — typeless-null operand, fallback fires → ARRAY<VARCHAR>
  * array(1)                 — INTEGER operand, fallback does NOT fire → ARRAY<INTEGER>
  * array('a', 'b')          — VARCHAR operands, fallback does NOT fire → ARRAY<VARCHAR>

The third case is the regression guard requested by review — confirms
concrete element types pass through unchanged and the fallback is scoped
strictly to the {@code NULL}/{@code UNKNOWN} markers.

The harness uses Calcite's {@link ExplicitOperatorBinding} bound to
{@link SqlLibraryOperators#ARRAY} so the inference's internal
{@code SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(...)}
call resolves the same operator the production code delegates to —
mocking {@code SqlOperatorBinding} directly hits NPEs deep inside
Calcite's least-restrictive-type computation.

Addresses review feedback on opensearch-project#5421.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit aa82704

The "why" lives in the PR description; the inline comment now points
there instead of duplicating it. Addresses dai-chen's review note on

Signed-off-by: Kai Huang <ahkcs@amazon.com>
opensearch-project#5421.
@ahkcs
Copy link
Copy Markdown
Collaborator Author

ahkcs commented May 8, 2026

Pushed aa82704 (UTs) and 22d0cce (trim inline comment).

Replies to the open questions:

@dai-chen — "with this change, we actually define VARCHAR[] array(), right?"

Yes, but only for the bare/empty form. Concrete-element calls keep the inferred element type:

  • array()ARRAY<VARCHAR> (was ARRAY<NULL>)
  • array(NULL)ARRAY<VARCHAR> (was ARRAY<UNKNOWN>)
  • array(1)ARRAY<INTEGER> (unchanged)
  • array('a', 'b')ARRAY<VARCHAR> (unchanged)
  • array(1, 1.5)ARRAY<DOUBLE> (unchanged — leastRestrictive widening from SqlLibraryOperators.ARRAY still wins)

testReturnTypeForIntegerOperandPreservesType in aa82704 is the regression guard for the third case.

@penghuo — "CAST(ARRAY[] to VARCHAR[])? Substrait limitation?"

Yes to both, replied inline.

@penghuo — "Postgres requires ARRAY[]::integer[], PPL implicit VARCHAR[]?"

Yes — PPL surface allows bare array() and we silently default to VARCHAR[]. Replied inline with the rationale and a follow-up note about adding explicit cast(array() as <T>[]) syntax later.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit 22d0cce

@ahkcs ahkcs requested review from dai-chen and penghuo May 8, 2026 18:32
@ahkcs ahkcs added the enhancement New feature or request label May 8, 2026
@ahkcs ahkcs merged commit 9124664 into opensearch-project:feature/mustang-ppl-integration May 8, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants