Bridge PPL_REX_MAX_MATCH_LIMIT into UnifiedQueryContext on the unified query path#5418
Conversation
PR Reviewer Guide 🔍(Review updated until commit aa0ae82)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to aa0ae82 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit c655e4a
Suggestions up to commit 6d0dc2c
Suggestions up to commit e46eaa0
Suggestions up to commit 62d617d
Suggestions up to commit 0f1c8ce
|
95521c9 to
0f1c8ce
Compare
|
Persistent review updated to latest commit 0f1c8ce |
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
## Why an adapter extension is necessary
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
opensearch-project#21527, is extended here to recognize 3 OR 4 operands. Pattern is at position 1
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
## Out of scope (deferred to Part 2)
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
## Test results
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
## Companion PR
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
## Why an adapter extension is necessary
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
opensearch-project#21527, is extended here to recognize 3 OR 4 operands. Pattern is at position 1
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
## Out of scope (deferred to Part 2)
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
## Test results
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
## Companion PR
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
|
Persistent review updated to latest commit 62d617d |
|
Persistent review updated to latest commit e46eaa0 |
The PPL `rex` command's AstBuilder reads `Settings.Key.PPL_REX_MAX_MATCH_LIMIT`
unconditionally and unboxes the result to `int`:
int maxMatchLimit =
(settings != null) ? settings.getSettingValue(...) : 10;
The `(settings != null)` guard only protects against the Settings instance
being null — not against `getSettingValue` returning null for a key that the
caller never registered. On the unified query path, `UnifiedQueryContext`
builds its `Settings` map with only a small whitelist of keys (e.g.
`QUERY_SIZE_LIMIT`, `CALCITE_ENGINE_ENABLED` per opensearch-project#5413). For any unregistered
key, `getSettingValue` returns null, and the auto-unbox NPEs the planner
before any operator-level capability check runs. Every `rex` query through
`/_analytics/ppl` (the analytics-engine route's REST front-end) hits this
NPE today.
Default `PPL_REX_MAX_MATCH_LIMIT=10` in `buildSettings()` so unified-path
behavior matches the cluster-side default registered by
`OpenSearchSettings.PPL_REX_MAX_MATCH_LIMIT_SETTING` — making the v2 path and
the analytics-engine path agree on the same fallback value when neither has
an explicit cluster override. Mirrors the precedent Kai introduced for
`CALCITE_ENGINE_ENABLED` in opensearch-project#5413.
Companion to the OpenSearch core PR onboarding PPL `rex mode=sed` to the
analytics-engine route via DataFusion (Part 1 — sed-mode bridge only;
extract-mode Rust UDFs deferred to Part 2).
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
e46eaa0 to
6d0dc2c
Compare
|
Persistent review updated to latest commit 6d0dc2c |
6d0dc2c to
c655e4a
Compare
|
Persistent review updated to latest commit c655e4a |
… setting() API
The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in
`UnifiedQueryContext.Builder.settings` to fix the NPE in
`AstBuilder.visitRexCommand` on the unified path. The default is correct, but
it doesn't respect mid-run cluster overrides — every key in the static map
returns its hardcoded value regardless of `_cluster/settings` updates. This
breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which sets the
cluster-side limit to 5 and asserts `max_match=0` caps at 5; on the unified
path it stayed at 10.
Rather than introducing a new `Settings`-typed Builder API, the REST handler
reads the live cluster value itself and routes it through the existing
`Builder.setting(String, Object)` method — keeping `UnifiedQueryContext`
decoupled from any specific `Settings` implementation:
RestUnifiedQueryAction.applyClusterOverrides(builder)
└── pluginSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT)
└── builder.setting("plugins.ppl.rex.max_match.limit", value)
`RestUnifiedQueryAction` gains a `pluginSettings` field (the same
`OpenSearchSettings` instance bound in the Guice module). Both
construction sites — `SQLPlugin.createSqlAnalyticsRouter` and
`TransportPPLQueryAction.<init>` — pass it through.
`RestUnifiedQueryActionTest` updated to pass a `mock(Settings.class)` for the
new constructor parameter.
## Why scoped to PPL_REX_MAX_MATCH_LIMIT only
The same architectural gap exists for every key in the static defaults map
(`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`,
`CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine
in practice (no test overrides them mid-run; `head N` covers
`QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally
pinned to `true` for the unified path. So this PR widens only the one key
that demonstrably needs it; widening the snapshot to the rest is a future
scope decision tied to whichever new IT first depends on it.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Pins two behaviors the previous commits introduced:
* `testContextCreationWithDefaults` now asserts that
`PPL_REX_MAX_MATCH_LIMIT` resolves to its static default of 10 — the
fallback value `AstBuilder.visitRexCommand` reads when no cluster-side
override is present.
* `testContextCreationWithCustomConfig` now asserts that
`setting("plugins.ppl.rex.max_match.limit", 5)` reaches
`getSettingValue(PPL_REX_MAX_MATCH_LIMIT)` — the path the REST handler
uses to forward an operator-configured cluster value into the
unified-path settings map.
Folds the two assertions into the existing default / custom-config tests
rather than adding new test methods, per review feedback.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
c655e4a to
aa0ae82
Compare
|
Persistent review updated to latest commit aa0ae82 |
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
d481e0b
into
opensearch-project:feature/mustang-ppl-integration
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Description
The PPL
rexcommand'sAstBuilderreadsSettings.Key.PPL_REX_MAX_MATCH_LIMITunconditionally and unboxes the result toint. On the unified query path,UnifiedQueryContextbuilds itsSettingsmap with only a small whitelist of planning-required keys; for any unregistered key,getSettingValuereturns null and the auto-unbox NPEs the planner before any operator-level capability check runs. Everyrexquery through/_analytics/pplhits this NPE today.This PR ships two changes that together let the unified path execute
rexcorrectly:1. Default
PPL_REX_MAX_MATCH_LIMIT=10inUnifiedQueryContextAdds the key to the static settings map so
AstBuilder.visitRexCommandno longer NPEs. The value mirrors the cluster-side default of10registered byOpenSearchSettings.PPL_REX_MAX_MATCH_LIMIT_SETTING, so unified-path behavior matches v2-path behavior when neither has an explicit cluster override. Mirrors the precedent Kai introduced forCALCITE_ENGINE_ENABLEDin #5413.2. Bridge live cluster settings for
PPL_REX_MAX_MATCH_LIMITonlyWithout this, every key in the static map resolves to its hardcoded default and
_cluster/settingsupdates are invisible to the unified path.CalciteRexCommandIT.testRexMaxMatchConfigurableLimitexercises this: it sets the cluster-side limit to 5 and asserts thatmax_match=0caps at 5, but on the unified path the static10keeps winning.Adds a
Builder.liveSettings(Settings)hook so the REST handler can inject the cluster's liveOpenSearchSettingsinstance. Atbuild()time the Builder snapshots the live value ofPPL_REX_MAX_MATCH_LIMITinto the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle ofUnifiedQueryContextand avoids per-call lookup overhead.RestUnifiedQueryActiongains apluginSettingsfield (the sameOpenSearchSettingsinstance bound in the Guice module) and forwards it to the Builder in bothbuildContextandbuildParsingContext. Both construction sites —SQLPlugin.createSqlAnalyticsRouterandTransportPPLQueryAction.<init>— are updated.Why scoped to
PPL_REX_MAX_MATCH_LIMITonlyThe same architectural gap exists for every key in the static map (
QUERY_SIZE_LIMIT,PPL_SUBSEARCH_MAXOUT,PPL_JOIN_SUBSEARCH_MAXOUT,CALCITE_ENGINE_ENABLED). For three of those, the static defaults are fine in practice (no test overrides them mid-run;head NcoversQUERY_SIZE_LIMITper-query).CALCITE_ENGINE_ENABLEDis intentionally pinned totruefor the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it.Companion PR
opensearch-project/OpenSearch#21550 — onboards PPL
rexto DataFusion via the analytics-engine path. Without this PR's fixes, every rex query through/_analytics/pplNPEs at parse time and never reaches the planner.Test results
CalciteRexCommandITthrough the analytics-engine route (every PPL query forced through/_analytics/pplviatests.analytics.force_routing=true):AstBuilder.visitRexCommand.testRexMaxMatchConfigurableLimitfails withexpected:<5> but was:<10>.testRexMaxMatch*variants honor the cluster setting.Signed-off-by: Jialiang Liang jiallian@amazon.com