Skip to content

Support StringDecode for GBK encoding#14545

Draft
thirtiseven wants to merge 3 commits intoNVIDIA:mainfrom
thirtiseven:feature/gbk-string-decode-clean
Draft

Support StringDecode for GBK encoding#14545
thirtiseven wants to merge 3 commits intoNVIDIA:mainfrom
thirtiseven:feature/gbk-string-decode-clean

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

Fixes #14542

Description

This PR adds support for StringDecode on GBK encoding.

Depends on NVIDIA/spark-rapids-jni#4431

performance testing:

=== GPU (spark-rapids) ===
Warmup (3 runs)...
Benchmark (5 runs)...
  GPU     min=1048 ms  median=1068 ms  avg=1075 ms  max=1101 ms

=== CPU (Spark Java) ===
Warmup (3 runs)...
Benchmark (5 runs)...
  CPU     min=13526 ms  median=13791 ms  avg=13766 ms  max=13984 ms

=== Speedup: 12.9x (CPU median / GPU median) ===

bench_gbk_decode.scala.zip

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

thirtiseven and others added 2 commits April 2, 2026 20:51
Register GPU override for Spark's StringDecode expression in
GpuOverrides.scala, supporting GBK charset decoding on GPU via
the new CharsetDecode JNI API in spark-rapids-jni.

- GpuStringDecode: GpuUnaryExpression that calls CharsetDecode.decode()
- Charset must be a literal; only GBK is supported (others fall back to CPU)
- Charset resolved at plan time, only binary column is runtime input

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR adds GPU-accelerated decode(bin, charset) for GBK encoding (~13× speedup over CPU). Pre-Spark 4.0 versions use a new BinaryExprMeta[StringDecode] shim, while Spark 4.0+ routes through a new StaticInvokeMeta in objectsExpressions.scala since Spark replaces StringDecode with a StaticInvoke during analysis. Non-GBK charsets correctly fall back to CPU via willNotWorkOnGpu.

Confidence Score: 5/5

Safe to merge; implementation is correct and consistent with existing patterns

All remaining findings are P2. The charsetName null concern is a defensive coding preference — the framework guarantees convertToGpuImpl is never called after willNotWorkOnGpu. The prior thread's argument bounds-check concern has been addressed with the arguments.size < 2 guard at line 59 of objectsExpressions.scala.

No files require special attention

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/objectsExpressions.scala New StaticInvokeMeta for Spark 4.0+ StringDecode via StaticInvoke; charsetName stays null if willNotWorkOnGpu fires, though the framework prevents convertToGpuImpl from running in that case
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala GpuStringDecode case class correctly delegates to CharsetDecode.decode JNI for GBK; NullIntolerantShim handles null input propagation
sql-plugin/src/main/spark321/scala/com/nvidia/spark/rapids/shims/StringDecodeShims.scala Pre-Spark 4.0 StringDecode shimmed as BinaryExprMeta with GBK-only GPU support; shim-json annotation covers all 321–358 versions including Databricks variants
sql-plugin/src/main/spark400/scala/com/nvidia/spark/rapids/shims/StringDecodeShims.scala Spark 4.0+ path correctly returns an empty map; StringDecode is handled via StaticInvokeMeta instead due to Spark's RuntimeReplaceable rewrite
integration_tests/src/main/python/string_test.py Adds basic (fixed byte sequences, nulls, invalid bytes) and random GBK decode tests; random test uses the framework's global datagen seed for reproducibility
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Wires StringDecodeShims.exprs into the expressions map alongside existing shim registrations

Sequence Diagram

sequenceDiagram
    participant Spark as Spark Catalyst
    participant ShimMeta as StringDecodeShims/StaticInvokeMeta
    participant GpuExpr as GpuStringDecode
    participant JNI as CharsetDecode (JNI)

    Spark->>ShimMeta: tagExprForGpu()
    ShimMeta->>ShimMeta: extractLit(charset) == GBK?
    alt GBK literal
        ShimMeta->>ShimMeta: charsetName = "GBK"
        ShimMeta->>GpuExpr: convertToGpu() → GpuStringDecode(bin, "GBK")
        GpuExpr->>JNI: CharsetDecode.decode(column, GBK)
        JNI-->>GpuExpr: UTF-8 ColumnVector
    else non-GBK or non-literal
        ShimMeta->>Spark: willNotWorkOnGpu → CPU fallback
    end
Loading

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Reviews (2): Last reviewed commit: "Address review comments on StaticInvokeM..." | Re-trigger Greptile

- Initialize charsetName to null explicitly
- Add bounds check on arguments before accessing index 1
- Add trailing newline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support StringDecode for GBK encoding

2 participants