Support StringDecode for GBK encoding#14545
Conversation
Register GPU override for Spark's StringDecode expression in GpuOverrides.scala, supporting GBK charset decoding on GPU via the new CharsetDecode JNI API in spark-rapids-jni. - GpuStringDecode: GpuUnaryExpression that calls CharsetDecode.decode() - Charset must be a literal; only GBK is supported (others fall back to CPU) - Charset resolved at plan time, only binary column is runtime input Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Greptile SummaryThis PR adds GPU-accelerated Confidence Score: 5/5Safe to merge; implementation is correct and consistent with existing patterns All remaining findings are P2. The charsetName null concern is a defensive coding preference — the framework guarantees convertToGpuImpl is never called after willNotWorkOnGpu. The prior thread's argument bounds-check concern has been addressed with the arguments.size < 2 guard at line 59 of objectsExpressions.scala. No files require special attention Important Files Changed
Sequence DiagramsequenceDiagram
participant Spark as Spark Catalyst
participant ShimMeta as StringDecodeShims/StaticInvokeMeta
participant GpuExpr as GpuStringDecode
participant JNI as CharsetDecode (JNI)
Spark->>ShimMeta: tagExprForGpu()
ShimMeta->>ShimMeta: extractLit(charset) == GBK?
alt GBK literal
ShimMeta->>ShimMeta: charsetName = "GBK"
ShimMeta->>GpuExpr: convertToGpu() → GpuStringDecode(bin, "GBK")
GpuExpr->>JNI: CharsetDecode.decode(column, GBK)
JNI-->>GpuExpr: UTF-8 ColumnVector
else non-GBK or non-literal
ShimMeta->>Spark: willNotWorkOnGpu → CPU fallback
end
Greploops — Automatically fix all review issues by running Reviews (2): Last reviewed commit: "Address review comments on StaticInvokeM..." | Re-trigger Greptile |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/objectsExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/objectsExpressions.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/objectsExpressions.scala
Show resolved
Hide resolved
- Initialize charsetName to null explicitly - Add bounds check on arguments before accessing index 1 - Add trailing newline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@greptile review |
Fixes #14542
Description
This PR adds support for StringDecode on GBK encoding.
Depends on NVIDIA/spark-rapids-jni#4431
performance testing:
bench_gbk_decode.scala.zip
Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)