Support StringDecode for GBK encoding by thirtiseven · Pull Request #14545 · NVIDIA/spark-rapids

thirtiseven · 2026-04-03T10:16:31Z

Description

This PR adds support for StringDecode on GBK encoding.

performance testing:

=== GPU (spark-rapids) ===
Warmup (3 runs)...
Benchmark (5 runs)...
  GPU     min=1048 ms  median=1068 ms  avg=1075 ms  max=1101 ms

=== CPU (Spark Java) ===
Warmup (3 runs)...
Benchmark (5 runs)...
  CPU     min=13526 ms  median=13791 ms  avg=13766 ms  max=13984 ms

=== Speedup: 12.9x (CPU median / GPU median) ===

bench_gbk_decode.scala.zip

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Register GPU override for Spark's StringDecode expression in GpuOverrides.scala, supporting GBK charset decoding on GPU via the new CharsetDecode JNI API in spark-rapids-jni. - GpuStringDecode: GpuUnaryExpression that calls CharsetDecode.decode() - Charset must be a literal; only GBK is supported (others fall back to CPU) - Charset resolved at plan time, only binary column is runtime input Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-04-03T10:17:00Z

@greptile review

greptile-apps · 2026-04-03T10:21:28Z

Greptile Summary

This PR adds GPU-accelerated decode(bin, charset) for GBK encoding (~13× speedup over CPU). Pre-Spark 4.0 versions use a new BinaryExprMeta[StringDecode] shim, while Spark 4.0+ routes through a new StaticInvokeMeta in objectsExpressions.scala since Spark replaces StringDecode with a StaticInvoke during analysis. Non-GBK charsets correctly fall back to CPU via willNotWorkOnGpu.

Confidence Score: 5/5

Safe to merge; implementation is correct and consistent with existing patterns

All remaining findings are P2. The charsetName null concern is a defensive coding preference — the framework guarantees convertToGpuImpl is never called after willNotWorkOnGpu. The prior thread's argument bounds-check concern has been addressed with the arguments.size < 2 guard at line 59 of objectsExpressions.scala.

No files require special attention

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/objectsExpressions.scala	New StaticInvokeMeta for Spark 4.0+ StringDecode via StaticInvoke; charsetName stays null if willNotWorkOnGpu fires, though the framework prevents convertToGpuImpl from running in that case
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala	GpuStringDecode case class correctly delegates to CharsetDecode.decode JNI for GBK; NullIntolerantShim handles null input propagation
sql-plugin/src/main/spark321/scala/com/nvidia/spark/rapids/shims/StringDecodeShims.scala	Pre-Spark 4.0 StringDecode shimmed as BinaryExprMeta with GBK-only GPU support; shim-json annotation covers all 321–358 versions including Databricks variants
sql-plugin/src/main/spark400/scala/com/nvidia/spark/rapids/shims/StringDecodeShims.scala	Spark 4.0+ path correctly returns an empty map; StringDecode is handled via StaticInvokeMeta instead due to Spark's RuntimeReplaceable rewrite
integration_tests/src/main/python/string_test.py	Adds basic (fixed byte sequences, nulls, invalid bytes) and random GBK decode tests; random test uses the framework's global datagen seed for reproducibility
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala	Wires StringDecodeShims.exprs into the expressions map alongside existing shim registrations

Sequence Diagram

sequenceDiagram
    participant Spark as Spark Catalyst
    participant ShimMeta as StringDecodeShims/StaticInvokeMeta
    participant GpuExpr as GpuStringDecode
    participant JNI as CharsetDecode (JNI)

    Spark->>ShimMeta: tagExprForGpu()
    ShimMeta->>ShimMeta: extractLit(charset) == GBK?
    alt GBK literal
        ShimMeta->>ShimMeta: charsetName = "GBK"
        ShimMeta->>GpuExpr: convertToGpu() → GpuStringDecode(bin, "GBK")
        GpuExpr->>JNI: CharsetDecode.decode(column, GBK)
        JNI-->>GpuExpr: UTF-8 ColumnVector
    else non-GBK or non-literal
        ShimMeta->>Spark: willNotWorkOnGpu → CPU fallback
    end

_{Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.}

_{Reviews (2): Last reviewed commit: "Address review comments on StaticInvokeM..." | Re-trigger Greptile}

sql-plugin/src/main/scala/com/nvidia/spark/rapids/objectsExpressions.scala

- Initialize charsetName to null explicitly - Add bounds check on arguments before accessing index 1 - Add trailing newline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thirtiseven · 2026-04-07T09:31:53Z

@greptile review

thirtiseven and others added 2 commits April 2, 2026 20:51

shims and docs

5ff8d57

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Address review comments on StaticInvokeMeta

4a9a61d

- Initialize charsetName to null explicitly - Add bounds check on arguments before accessing index 1 - Add trailing newline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thirtiseven mentioned this pull request Apr 9, 2026

Add GBK charset decoding kernel for StringDecode NVIDIA/spark-rapids-jni#4431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support StringDecode for GBK encoding#14545

Support StringDecode for GBK encoding#14545
thirtiseven wants to merge 3 commits intoNVIDIA:mainfrom
thirtiseven:feature/gbk-string-decode-clean

thirtiseven commented Apr 3, 2026

Uh oh!

thirtiseven commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thirtiseven commented Apr 3, 2026

Description

Checklists

Uh oh!

thirtiseven commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Apr 3, 2026 •

edited

Loading