Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ Name | Description | Default Value | Applicable at
<a name="sql.window.range.int.enabled"></a>spark.rapids.sql.window.range.int.enabled|When the order-by column of a range based window is int type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the int type order-by column|true|Runtime
<a name="sql.window.range.long.enabled"></a>spark.rapids.sql.window.range.long.enabled|When the order-by column of a range based window is long type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the long type order-by column|true|Runtime
<a name="sql.window.range.short.enabled"></a>spark.rapids.sql.window.range.short.enabled|When the order-by column of a range based window is short type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the short type order-by column|false|Runtime

## Supported GPU Operators and Fine Tuning
_The RAPIDS Accelerator for Apache Spark_ can be configured to enable or disable specific
GPU accelerated expressions. Enabled expressions are candidates for GPU execution. If the
Expand Down Expand Up @@ -385,6 +386,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.Sqrt"></a>spark.rapids.sql.expression.Sqrt|`sqrt`|Square root|true|None|
<a name="sql.expression.Stack"></a>spark.rapids.sql.expression.Stack|`stack`|Separates expr1, ..., exprk into n rows.|true|None|
<a name="sql.expression.StartsWith"></a>spark.rapids.sql.expression.StartsWith| |Starts with|true|None|
<a name="sql.expression.StringDecode"></a>spark.rapids.sql.expression.StringDecode| |Decodes binary data from a charset to a UTF-8 string|true|None|
<a name="sql.expression.StringInstr"></a>spark.rapids.sql.expression.StringInstr|`instr`|Instr string operator|true|None|
<a name="sql.expression.StringLPad"></a>spark.rapids.sql.expression.StringLPad| |Pad a string on the left|true|None|
<a name="sql.expression.StringLocate"></a>spark.rapids.sql.expression.StringLocate|`locate`, `position`|Substring search operator|true|None|
Expand Down
78 changes: 76 additions & 2 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -8267,7 +8267,7 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
<td> </td>
Expand All @@ -8290,7 +8290,7 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
<td> </td>
Expand Down Expand Up @@ -16773,6 +16773,80 @@ are limited.
<th>YEARMONTH</th>
</tr>
<tr>
<td rowSpan="3">StringDecode</td>
<td rowSpan="3"> </td>
<td rowSpan="3">Decodes binary data from a charset to a UTF-8 string</td>
<td rowSpan="3">None</td>
<td rowSpan="3">project</td>
<td>bin</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>charset</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>Literal value only</em></td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="3">StringInstr</td>
<td rowSpan="3">`instr`</td>
<td rowSpan="3">Instr string operator</td>
Expand Down
22 changes: 22 additions & 0 deletions integration_tests/src/main/python/string_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,28 @@

_regexp_conf = { 'spark.rapids.sql.regexp.enabled': 'true' }

def test_string_decode_gbk_basic():
"""Test GBK charset decoding with known Chinese characters, ASCII, mixed content, and nulls."""
data = [
(bytearray(b'\xc4\xe3\xba\xc3'),), # 你好
(bytearray(b'\xca\xc0\xbd\xe7'),), # 世界
(bytearray(b'Hello'),), # Pure ASCII
(bytearray(b'Hi\xc4\xe3\xba\xc3World'),), # Mixed ASCII + Chinese
(bytearray(b''),), # Empty
(None,), # Null
(bytearray(b'\xff\xff'),), # Invalid bytes
(bytearray(b'\x81'),), # Truncated lead byte
]
assert_gpu_and_cpu_are_equal_collect(
lambda spark: spark.createDataFrame(data, "bin binary").selectExpr(
"decode(bin, 'GBK')"))

def test_string_decode_gbk_random():
"""Test GBK decoding with random binary data to exercise error handling paths."""
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, BinaryGen(min_length=0, max_length=50), length=2048)
.selectExpr("decode(a, 'GBK')"))

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4184,7 +4184,8 @@ object GpuOverrides extends Logging {
val expressions: Map[Class[_ <: Expression], ExprRule[_ <: Expression]] =
commonExpressions ++ TimeStamp.getExprs ++ GpuHiveOverrides.exprs ++
ZOrderRules.exprs ++ DecimalArithmeticOverrides.exprs ++
BloomFilterShims.exprs ++ InSubqueryShims.exprs ++ RaiseErrorShim.exprs ++
BloomFilterShims.exprs ++ StringDecodeShims.exprs ++
InSubqueryShims.exprs ++ RaiseErrorShim.exprs ++
ExternalSource.exprRules ++ SparkShimImpl.getExprs

def wrapScan[INPUT <: Scan](
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,25 +16,65 @@

package com.nvidia.spark.rapids

import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
import org.apache.spark.sql.rapids.ExternalSource
import org.apache.spark.sql.rapids.{ExternalSource, GpuStringDecode}

/**
* Meta class for overriding StaticInvoke expressions.
* <br/>
* When writing to partitioned table, iceberg needs to compute the partition values based on the
* partition spec using [[StaticInvoke]] expression.
* Handles two cases:
* - Iceberg partition value computation via StaticInvoke
* - Spark 4.0+ StringDecode (RuntimeReplaceable replaced by StaticInvoke)
*/
class StaticInvokeMeta(expr: StaticInvoke,
conf: RapidsConf,
parent: Option[RapidsMeta[_, _, _]],
rule: DataFromReplacementRule) extends ExprMeta[StaticInvoke](expr, conf, parent, rule) {

private val isStringDecode: Boolean = {
expr.staticObject.getName == "org.apache.spark.sql.catalyst.expressions.StringDecode" &&
expr.functionName == "decode"
}

private var charsetName: String = _

override val childExprs: Seq[BaseExprMeta[_]] = if (isStringDecode) {
// StringDecode StaticInvoke: decode(bin, charset, legacyCharsets, legacyErrorAction)
// Only wrap the first child (bin) as the GPU expression input
expr.arguments.take(1).map(GpuOverrides.wrapExpr(_, conf, Some(this)))
} else {
Seq.empty
}

override def tagExprForGpu(): Unit = {
ExternalSource.tagForGpu(expr, this)
if (isStringDecode) {
tagStringDecode()
} else {
ExternalSource.tagForGpu(expr, this)
}
}

private def tagStringDecode(): Unit = {
// charset is the second argument, must be a foldable string literal
val charsetExpr = expr.arguments(1)
GpuOverrides.extractLit(charsetExpr).map(_.value) match {
case Some(cs: org.apache.spark.unsafe.types.UTF8String) if cs != null =>
charsetName = cs.toString.toUpperCase
if (charsetName != "GBK") {
willNotWorkOnGpu(s"only GBK charset is supported on GPU, got: $charsetName")
}
case _ =>
willNotWorkOnGpu("charset must be a string literal for GPU StringDecode")
}
}

override def convertToGpuImpl(): GpuExpression = {
ExternalSource.convertToGpu(expr, this)
if (isStringDecode) {
val bin = childExprs.head.convertToGpu().asInstanceOf[Expression]
GpuStringDecode(bin, charsetName)
} else {
ExternalSource.convertToGpu(expr, this)
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import com.nvidia.spark.rapids.Arm._
import com.nvidia.spark.rapids.RapidsPluginImplicits._
import com.nvidia.spark.rapids.jni.{Arithmetic, RoundMode}
import com.nvidia.spark.rapids.jni.CastStrings
import com.nvidia.spark.rapids.jni.CharsetDecode
import com.nvidia.spark.rapids.jni.GpuSubstringIndexUtils
import com.nvidia.spark.rapids.jni.NumberConverter
import com.nvidia.spark.rapids.jni.RegexRewriteUtils
Expand Down Expand Up @@ -2516,3 +2517,22 @@ case class GpuFormatNumber(x: Expression, d: Expression)
}
}
}

case class GpuStringDecode(bin: Expression, charsetName: String)
extends GpuUnaryExpression with ImplicitCastInputTypes with NullIntolerantShim {

override def child: Expression = bin

override def dataType: DataType = StringType

override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)

override def doColumnar(input: GpuColumnVector): ColumnVector = {
val charsetId = charsetName match {
case "GBK" => CharsetDecode.GBK
case other =>
throw new UnsupportedOperationException(s"Unsupported charset on GPU: $other")
}
CharsetDecode.decode(input.getBase, charsetId)
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
* Copyright (c) 2026, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

/*** spark-rapids-shim-json-lines
{"spark": "321"}
{"spark": "330"}
{"spark": "330db"}
{"spark": "331"}
{"spark": "332"}
{"spark": "332db"}
{"spark": "333"}
{"spark": "334"}
{"spark": "340"}
{"spark": "341"}
{"spark": "341db"}
{"spark": "342"}
{"spark": "343"}
{"spark": "344"}
{"spark": "350"}
{"spark": "350db143"}
{"spark": "351"}
{"spark": "352"}
{"spark": "353"}
{"spark": "354"}
{"spark": "355"}
{"spark": "356"}
{"spark": "357"}
{"spark": "358"}
spark-rapids-shim-json-lines ***/
package com.nvidia.spark.rapids.shims

import com.nvidia.spark.rapids._

import org.apache.spark.sql.catalyst.expressions.{Expression, StringDecode}
import org.apache.spark.sql.rapids.GpuStringDecode

object StringDecodeShims {
val exprs: Map[Class[_ <: Expression], ExprRule[_ <: Expression]] = Seq(
GpuOverrides.expr[StringDecode](
"Decodes binary data from a charset to a UTF-8 string",
ExprChecks.binaryProject(TypeSig.STRING, TypeSig.STRING,
("bin", TypeSig.BINARY, TypeSig.BINARY),
("charset", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING)),
(a, conf, p, r) => new BinaryExprMeta[StringDecode](a, conf, p, r) {
private var charsetName: String = _
override def tagExprForGpu(): Unit = {
GpuOverrides.extractStringLit(a.charset) match {
case Some(cs) if cs != null =>
charsetName = cs.toUpperCase
if (charsetName != "GBK") {
willNotWorkOnGpu(s"only GBK charset is supported on GPU, got: $charsetName")
}
case _ =>
willNotWorkOnGpu("charset must be a string literal for GPU StringDecode")
}
}
override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression =
GpuStringDecode(lhs, charsetName)
})
).map(r => (r.getClassFor.asSubclass(classOf[Expression]), r)).toMap
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/*
* Copyright (c) 2026, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

/*** spark-rapids-shim-json-lines
{"spark": "400"}
{"spark": "400db173"}
{"spark": "401"}
{"spark": "402"}
{"spark": "411"}
spark-rapids-shim-json-lines ***/
package com.nvidia.spark.rapids.shims

import com.nvidia.spark.rapids.ExprRule

import org.apache.spark.sql.catalyst.expressions.Expression

// Spark 4.0+ changed StringDecode to RuntimeReplaceable, which is replaced by
// StaticInvoke(StringDecode.class, "decode", ...) during analysis.
// GPU acceleration is handled in StaticInvokeMeta (objectsExpressions.scala).
object StringDecodeShims {
val exprs: Map[Class[_ <: Expression], ExprRule[_ <: Expression]] = Map.empty
}
1 change: 1 addition & 0 deletions tools/generated_files/330/operatorsScore.csv
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ StartsWith,4
StaticInvoke,4
StddevPop,4
StddevSamp,4
StringDecode,4
StringInstr,4
StringLPad,4
StringLocate,4
Expand Down
3 changes: 3 additions & 0 deletions tools/generated_files/330/supportedExprs.csv
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,9 @@ Stack,S,`stack`,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS
StartsWith,S, ,None,project,src,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StartsWith,S, ,None,project,search,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StartsWith,S, ,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringDecode,S, ,None,project,bin,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
StringDecode,S, ,None,project,charset,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringDecode,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringInstr,S,`instr`,None,project,str,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringInstr,S,`instr`,None,project,substr,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringInstr,S,`instr`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
Expand Down
1 change: 1 addition & 0 deletions tools/generated_files/331/operatorsScore.csv
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ StartsWith,4
StaticInvoke,4
StddevPop,4
StddevSamp,4
StringDecode,4
StringInstr,4
StringLPad,4
StringLocate,4
Expand Down
Loading
Loading