Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ Name | Description | Default Value | Applicable at
<a name="sql.window.range.int.enabled"></a>spark.rapids.sql.window.range.int.enabled|When the order-by column of a range based window is int type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the int type order-by column|true|Runtime
<a name="sql.window.range.long.enabled"></a>spark.rapids.sql.window.range.long.enabled|When the order-by column of a range based window is long type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the long type order-by column|true|Runtime
<a name="sql.window.range.short.enabled"></a>spark.rapids.sql.window.range.short.enabled|When the order-by column of a range based window is short type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the short type order-by column|false|Runtime

## Supported GPU Operators and Fine Tuning
_The RAPIDS Accelerator for Apache Spark_ can be configured to enable or disable specific
GPU accelerated expressions. Enabled expressions are candidates for GPU execution. If the
Expand Down Expand Up @@ -385,6 +386,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.Sqrt"></a>spark.rapids.sql.expression.Sqrt|`sqrt`|Square root|true|None|
<a name="sql.expression.Stack"></a>spark.rapids.sql.expression.Stack|`stack`|Separates expr1, ..., exprk into n rows.|true|None|
<a name="sql.expression.StartsWith"></a>spark.rapids.sql.expression.StartsWith| |Starts with|true|None|
<a name="sql.expression.StringDecode"></a>spark.rapids.sql.expression.StringDecode| |Decodes binary data from a charset to a UTF-8 string|true|None|
<a name="sql.expression.StringInstr"></a>spark.rapids.sql.expression.StringInstr|`instr`|Instr string operator|true|None|
<a name="sql.expression.StringLPad"></a>spark.rapids.sql.expression.StringLPad| |Pad a string on the left|true|None|
<a name="sql.expression.StringLocate"></a>spark.rapids.sql.expression.StringLocate|`locate`, `position`|Substring search operator|true|None|
Expand Down
78 changes: 76 additions & 2 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -8267,7 +8267,7 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
<td> </td>
Expand All @@ -8290,7 +8290,7 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
<td> </td>
Expand Down Expand Up @@ -16773,6 +16773,80 @@ are limited.
<th>YEARMONTH</th>
</tr>
<tr>
<td rowSpan="3">StringDecode</td>
<td rowSpan="3"> </td>
<td rowSpan="3">Decodes binary data from a charset to a UTF-8 string</td>
<td rowSpan="3">None</td>
<td rowSpan="3">project</td>
<td>bin</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>charset</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>Literal value only</em></td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="3">StringInstr</td>
<td rowSpan="3">`instr`</td>
<td rowSpan="3">Instr string operator</td>
Expand Down
22 changes: 22 additions & 0 deletions integration_tests/src/main/python/string_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,28 @@

_regexp_conf = { 'spark.rapids.sql.regexp.enabled': 'true' }

def test_string_decode_gbk_basic():
"""Test GBK charset decoding with known Chinese characters, ASCII, mixed content, and nulls."""
data = [
(bytearray(b'\xc4\xe3\xba\xc3'),), # 你好
(bytearray(b'\xca\xc0\xbd\xe7'),), # 世界
(bytearray(b'Hello'),), # Pure ASCII
(bytearray(b'Hi\xc4\xe3\xba\xc3World'),), # Mixed ASCII + Chinese
(bytearray(b''),), # Empty
(None,), # Null
(bytearray(b'\xff\xff'),), # Invalid bytes
(bytearray(b'\x81'),), # Truncated lead byte
]
assert_gpu_and_cpu_are_equal_collect(
lambda spark: spark.createDataFrame(data, "bin binary").selectExpr(
"decode(bin, 'GBK')"))

def test_string_decode_gbk_random():
"""Test GBK decoding with random binary data to exercise error handling paths."""
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, BinaryGen(min_length=0, max_length=50), length=2048)
.selectExpr("decode(a, 'GBK')"))

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4184,7 +4184,8 @@ object GpuOverrides extends Logging {
val expressions: Map[Class[_ <: Expression], ExprRule[_ <: Expression]] =
commonExpressions ++ TimeStamp.getExprs ++ GpuHiveOverrides.exprs ++
ZOrderRules.exprs ++ DecimalArithmeticOverrides.exprs ++
BloomFilterShims.exprs ++ InSubqueryShims.exprs ++ RaiseErrorShim.exprs ++
BloomFilterShims.exprs ++ StringDecodeShims.exprs ++
InSubqueryShims.exprs ++ RaiseErrorShim.exprs ++
ExternalSource.exprRules ++ SparkShimImpl.getExprs

def wrapScan[INPUT <: Scan](
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,25 +16,69 @@

package com.nvidia.spark.rapids

import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
import org.apache.spark.sql.rapids.ExternalSource
import org.apache.spark.sql.rapids.{ExternalSource, GpuStringDecode}

/**
* Meta class for overriding StaticInvoke expressions.
* <br/>
* When writing to partitioned table, iceberg needs to compute the partition values based on the
* partition spec using [[StaticInvoke]] expression.
* Handles two cases:
* - Iceberg partition value computation via StaticInvoke
* - Spark 4.0+ StringDecode (RuntimeReplaceable replaced by StaticInvoke)
*/
class StaticInvokeMeta(expr: StaticInvoke,
conf: RapidsConf,
parent: Option[RapidsMeta[_, _, _]],
rule: DataFromReplacementRule) extends ExprMeta[StaticInvoke](expr, conf, parent, rule) {

private val isStringDecode: Boolean = {
expr.staticObject.getName == "org.apache.spark.sql.catalyst.expressions.StringDecode" &&
expr.functionName == "decode"
}

private var charsetName: String = null

override val childExprs: Seq[BaseExprMeta[_]] = if (isStringDecode) {
// StringDecode StaticInvoke: decode(bin, charset, legacyCharsets, legacyErrorAction)
// Only wrap the first child (bin) as the GPU expression input
expr.arguments.take(1).map(GpuOverrides.wrapExpr(_, conf, Some(this)))
} else {
Seq.empty
}

override def tagExprForGpu(): Unit = {
ExternalSource.tagForGpu(expr, this)
if (isStringDecode) {
tagStringDecode()
} else {
ExternalSource.tagForGpu(expr, this)
}
}

private def tagStringDecode(): Unit = {
if (expr.arguments.size < 2) {
willNotWorkOnGpu("StringDecode StaticInvoke has unexpected argument count")
return
}
// charset is the second argument, must be a foldable string literal
val charsetExpr = expr.arguments(1)
GpuOverrides.extractLit(charsetExpr).map(_.value) match {
case Some(cs: org.apache.spark.unsafe.types.UTF8String) if cs != null =>
charsetName = cs.toString.toUpperCase
if (charsetName != "GBK") {
willNotWorkOnGpu(s"only GBK charset is supported on GPU, got: $charsetName")
}
case _ =>
willNotWorkOnGpu("charset must be a string literal for GPU StringDecode")
}
}

override def convertToGpuImpl(): GpuExpression = {
ExternalSource.convertToGpu(expr, this)
if (isStringDecode) {
val bin = childExprs.head.convertToGpu().asInstanceOf[Expression]
GpuStringDecode(bin, charsetName)
} else {
ExternalSource.convertToGpu(expr, this)
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import com.nvidia.spark.rapids.Arm._
import com.nvidia.spark.rapids.RapidsPluginImplicits._
import com.nvidia.spark.rapids.jni.{Arithmetic, RoundMode}
import com.nvidia.spark.rapids.jni.CastStrings
import com.nvidia.spark.rapids.jni.CharsetDecode
import com.nvidia.spark.rapids.jni.GpuSubstringIndexUtils
import com.nvidia.spark.rapids.jni.NumberConverter
import com.nvidia.spark.rapids.jni.RegexRewriteUtils
Expand Down Expand Up @@ -2516,3 +2517,22 @@ case class GpuFormatNumber(x: Expression, d: Expression)
}
}
}

case class GpuStringDecode(bin: Expression, charsetName: String)
extends GpuUnaryExpression with ImplicitCastInputTypes with NullIntolerantShim {

override def child: Expression = bin

override def dataType: DataType = StringType

override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)

override def doColumnar(input: GpuColumnVector): ColumnVector = {
val charsetId = charsetName match {
case "GBK" => CharsetDecode.GBK
case other =>
throw new UnsupportedOperationException(s"Unsupported charset on GPU: $other")
}
CharsetDecode.decode(input.getBase, charsetId)
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
* Copyright (c) 2026, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

/*** spark-rapids-shim-json-lines
{"spark": "321"}
{"spark": "330"}
{"spark": "330db"}
{"spark": "331"}
{"spark": "332"}
{"spark": "332db"}
{"spark": "333"}
{"spark": "334"}
{"spark": "340"}
{"spark": "341"}
{"spark": "341db"}
{"spark": "342"}
{"spark": "343"}
{"spark": "344"}
{"spark": "350"}
{"spark": "350db143"}
{"spark": "351"}
{"spark": "352"}
{"spark": "353"}
{"spark": "354"}
{"spark": "355"}
{"spark": "356"}
{"spark": "357"}
{"spark": "358"}
spark-rapids-shim-json-lines ***/
package com.nvidia.spark.rapids.shims

import com.nvidia.spark.rapids._

import org.apache.spark.sql.catalyst.expressions.{Expression, StringDecode}
import org.apache.spark.sql.rapids.GpuStringDecode

object StringDecodeShims {
val exprs: Map[Class[_ <: Expression], ExprRule[_ <: Expression]] = Seq(
GpuOverrides.expr[StringDecode](
"Decodes binary data from a charset to a UTF-8 string",
ExprChecks.binaryProject(TypeSig.STRING, TypeSig.STRING,
("bin", TypeSig.BINARY, TypeSig.BINARY),
("charset", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING)),
(a, conf, p, r) => new BinaryExprMeta[StringDecode](a, conf, p, r) {
private var charsetName: String = _
override def tagExprForGpu(): Unit = {
GpuOverrides.extractStringLit(a.charset) match {
case Some(cs) if cs != null =>
charsetName = cs.toUpperCase
if (charsetName != "GBK") {
willNotWorkOnGpu(s"only GBK charset is supported on GPU, got: $charsetName")
}
case _ =>
willNotWorkOnGpu("charset must be a string literal for GPU StringDecode")
}
}
override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression =
GpuStringDecode(lhs, charsetName)
})
).map(r => (r.getClassFor.asSubclass(classOf[Expression]), r)).toMap
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/*
* Copyright (c) 2026, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

/*** spark-rapids-shim-json-lines
{"spark": "400"}
{"spark": "400db173"}
{"spark": "401"}
{"spark": "402"}
{"spark": "411"}
spark-rapids-shim-json-lines ***/
package com.nvidia.spark.rapids.shims

import com.nvidia.spark.rapids.ExprRule

import org.apache.spark.sql.catalyst.expressions.Expression

// Spark 4.0+ changed StringDecode to RuntimeReplaceable, which is replaced by
// StaticInvoke(StringDecode.class, "decode", ...) during analysis.
// GPU acceleration is handled in StaticInvokeMeta (objectsExpressions.scala).
object StringDecodeShims {
val exprs: Map[Class[_ <: Expression], ExprRule[_ <: Expression]] = Map.empty
}
1 change: 1 addition & 0 deletions tools/generated_files/330/operatorsScore.csv
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ StartsWith,4
StaticInvoke,4
StddevPop,4
StddevSamp,4
StringDecode,4
StringInstr,4
StringLPad,4
StringLocate,4
Expand Down
3 changes: 3 additions & 0 deletions tools/generated_files/330/supportedExprs.csv
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,9 @@ Stack,S,`stack`,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS
StartsWith,S, ,None,project,src,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StartsWith,S, ,None,project,search,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StartsWith,S, ,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringDecode,S, ,None,project,bin,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
StringDecode,S, ,None,project,charset,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringDecode,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringInstr,S,`instr`,None,project,str,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringInstr,S,`instr`,None,project,substr,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
StringInstr,S,`instr`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
Expand Down
1 change: 1 addition & 0 deletions tools/generated_files/331/operatorsScore.csv
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ StartsWith,4
StaticInvoke,4
StddevPop,4
StddevSamp,4
StringDecode,4
StringInstr,4
StringLPad,4
StringLocate,4
Expand Down
Loading
Loading