Add GBK charset decoding kernel for StringDecode by thirtiseven · Pull Request #4431 · NVIDIA/spark-rapids-jni

thirtiseven · 2026-04-03T04:54:07Z

Add a GPU kernel that decodes GBK-encoded binary data to UTF-8 strings, matching Java's GBK CharsetDecoder behavior for use with Spark's StringDecode expression.

Lookup table generated from Java's GBK CharsetDecoder (REPORT mode), with generator script included for reproducibility
Two-pass kernel using cudf::strings::detail::make_strings_children
Invalid byte sequences replaced with U+FFFD
Second byte 0x7F excluded from pair consumption (matches Java)
Second byte 0xFF consumed as pair but mapped to U+FFFD (matches Java)
Java unit test with full 24066 byte-pair end-to-end verification

Add a GPU kernel that decodes GBK-encoded binary data to UTF-8 strings, matching Java's GBK CharsetDecoder behavior for use with Spark's StringDecode expression. - Lookup table generated from Java's GBK CharsetDecoder (REPORT mode), with generator script included for reproducibility - Two-pass kernel using cudf::strings::detail::make_strings_children - Invalid byte sequences replaced with U+FFFD - Second byte 0x7F excluded from pair consumption (matches Java) - Second byte 0xFF consumed as pair but mapped to U+FFFD (matches Java) - Java unit test with full 24066 byte-pair end-to-end verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps · 2026-04-03T05:01:07Z

Greptile Summary

This PR adds a GPU kernel that decodes GBK-encoded LIST<UINT8> columns to UTF-8 strings, matching Java's GBK CharsetDecoder behavior for Spark's StringDecode expression. The implementation uses a pre-generated lookup table and a two-pass make_strings_children kernel with correct handling of all edge cases (lead byte without a pair, second byte 0x7F, second byte 0xFF, sliced columns, null rows). Previous review concerns — static singleton lifecycle, 0x80 JDK dependency, REPORT vs REPLACE mode discrepancy, and Java-side charset validation — have all been addressed or formally acknowledged.

Confidence Score: 5/5

Safe to merge; all prior blocking concerns have been resolved, one minor P2 defensive suggestion remains.

The only open finding is a missing static_assert to guard table size, which is a P2 defensive suggestion. All P0/P1 concerns from prior review rounds have been addressed (Java-side validation added, 0x80 JDK behavior verified, REPORT/REPLACE mode difference documented and tested end-to-end). The two-pass kernel logic, sliced-column handling, null propagation, and edge-case byte sequences are all correctly implemented and backed by a comprehensive test suite including full 24066 pair verification.

src/main/cpp/src/charset_decode.cu — consider adding the static_assert for table size

Vulnerabilities

No security concerns identified. The kernel performs fixed-size table lookups with bounds-checked indices; there is no user-controlled allocation size or code-execution path.

Important Files Changed

Filename	Overview
src/main/cpp/src/charset_decode.cu	New GBK→UTF-8 two-pass kernel using make_strings_children; logic is correct for all lead/second-byte edge cases, but lacks a static_assert to guard table size
src/main/cpp/src/charset_decode.hpp	New header with charset_type enum and decode_charset declaration; clean and minimal
src/main/cpp/src/CharsetDecodeJni.cpp	JNI bridge follows project conventions; input validated on Java side, uses default stream and device resource ref
src/main/cpp/scripts/GenerateGbkTable.java	Table generator uses REPORT mode with exception catch → 0xFFFD; iterates second bytes 0x40–0xFE, producing 24066 entries matching GBK_TABLE_SIZE
src/main/java/com/nvidia/spark/rapids/jni/CharsetDecode.java	Java API with charset constant and validation guard before JNI call; addresses prior review comment
src/test/java/com/nvidia/spark/rapids/jni/CharsetDecodeTest.java	Thorough test suite: ASCII, Chinese, mixed, empty, null rows, invalid bytes, truncated sequence, byte 0x80, all 24066 double-byte pairs, and sliced column regression test
src/main/cpp/CMakeLists.txt	charset_decode.cu and CharsetDecodeJni.cpp correctly added to the build targets

Sequence Diagram

sequenceDiagram
    participant Java as CharsetDecode.java
    participant JNI as CharsetDecodeJni.cpp
    participant CPU as decode_charset()
    participant Init as get_gbk_table()
    participant GPU as gbk_decode_fn kernel

    Java->>Java: validate charset != GBK → throw
    Java->>JNI: decodeNative(nativeView, charset)
    JNI->>CPU: decode_charset(input, GBK, stream, mr)
    CPU->>Init: get_gbk_table(stream)
    Init-->>Init: call_once: memcpy host→device + stream.synchronize()
    Init-->>CPU: device pointer to 24066 uint16_t entries
    CPU->>GPU: make_strings_children pass 1 (d_chars=null): compute output sizes
    GPU-->>CPU: d_sizes filled
    CPU->>GPU: make_strings_children pass 2 (d_chars set): write UTF-8 bytes
    GPU-->>CPU: new_offsets + new_chars buffers
    CPU-->>JNI: strings column (UTF-8, null mask copied)
    JNI-->>Java: ColumnVector handle

_{Reviews (4): Last reviewed commit: "address comments" | Re-trigger Greptile}

greptile-apps · 2026-04-03T05:01:11Z

src/main/cpp/src/charset_decode.cu

+      if (byte <= 0x7F) {
+        // ASCII: single byte, maps directly to UTF-8
+        if (out_ptr) { *out_ptr++ = static_cast<char>(byte); }
+        output_size += 1;
+        i += 1;
+      } else if (byte >= GBK_FIRST_BYTE_MIN && byte <= GBK_FIRST_BYTE_MAX && (i + 1) < len) {


Byte 0x80 not mapped to U+20AC (€) — mismatch with Java GBK

Java's "GBK" charset in both Oracle JDK and OpenJDK is implemented as CP936/Windows-936, which maps the single byte 0x80 to U+20AC (Euro sign, €). The current kernel treats 0x80 as an invalid lead byte (it falls outside [GBK_FIRST_BYTE_MIN, GBK_FIRST_BYTE_MAX] = [0x81, 0xFE]) and emits U+FFFD instead.

This causes a behavioral divergence from Spark's CPU-side StringDecode("GBK") on any JVM where GBK maps 0x80 → € (i.e., practically all modern JVMs). The testByte0x80 unit test acknowledges this JDK dependency — the test itself will fail on any JVM that maps 0x80 → € because the GPU always returns FFFD.

To match Java's GBK decoder, byte 0x80 should be decoded as U+20AC:

if (byte == 0x80) { // CP936/Windows-936 extension: 0x80 → U+20AC (EURO SIGN) int utf8_len = codepoint_to_utf8(0x20AC, out_ptr); if (out_ptr) { out_ptr += utf8_len; } output_size += utf8_len; i += 1; } else if (byte <= 0x7F) { // ASCII ...

We verified this on the actual build JDK (OpenJDK 1.8.0_482): new String(new byte[]{(byte)0x80}, "GBK") returns U+FFFD, not U+20AC. We also tried adding a 0x80 → U+20AC special case in the kernel — the integration test (GPU vs CPU comparison with random binary data) failed immediately because the CPU side produces FFFD.

The testByte0x80 test uses decodeGbkJava() which dynamically queries the JDK behavior as ground truth, so it adapts to whichever JDK is in use. The claim that "practically all modern JVMs" map 0x80 to € does not hold for this JDK.

greptile-apps · 2026-04-03T05:01:11Z

src/main/cpp/src/charset_decode.cu

+static std::once_flag gbk_table_init_flag;
+static std::unique_ptr<rmm::device_uvector<uint16_t>> gbk_table_device;
+
+uint16_t const* get_gbk_table(rmm::cuda_stream_view stream)
+{
+  // synchronize() ensures the memcpy completes before any kernel on any stream
+  // can use the table pointer returned below.
+  std::call_once(gbk_table_init_flag, [stream]() {
+    gbk_table_device = std::make_unique<rmm::device_uvector<uint16_t>>(GBK_TABLE_SIZE, stream);
+    cudf::detail::cuda_memcpy_async(
+      cudf::device_span<uint16_t>(gbk_table_device->data(), GBK_TABLE_SIZE),
+      cudf::host_span<uint16_t const>(gbk_to_unicode_host, GBK_TABLE_SIZE),
+      stream);
+    stream.synchronize();
+  });
+  return gbk_table_device->data();
+}


Static device table singleton — dangling pointer risk on CUDA/RMM re-init

gbk_table_device and gbk_table_init_flag are file-scope statics. std::once_flag is never reset. If CUDA or RMM is torn down and re-initialized within the same process (a common pattern in unit-test suites and JVM-embedded workloads), the sequence is:

First run: table is allocated with the stream from the first call.

RMM allocator is reset / CUDA context is destroyed (test teardown).

gbk_table_device's internal device pointer is now dangling.

Next call to decode_gbk: call_once sees the flag already set, skips initialization, and returns the dangling pointer.

The GPU kernel reads from freed device memory → silent corruption or crash.

Additionally, at process exit, the static destructor of gbk_table_device issues a cudaFree after CUDA is shut down, which can surface as a CUDA error in test logs.

Consider managing this table through the existing cudf resource lifecycle, or at minimum validating the device pointer on each call:

// Option: use a function-local static that ties lifetime to RMM resource // Or check before returning: CUDF_EXPECTS(gbk_table_device != nullptr, "GBK table not initialized");

Acknowledged. This is a known pattern in the project — other static singletons like the bloom filter hash seed table (bloom_filter.cu) follow the same approach. In spark-rapids-jni, the CUDA context and RMM allocator are not torn down and re-initialized within an executor process lifetime.

If this becomes a project-wide concern, it should be addressed holistically rather than in this PR alone.

greptile-apps · 2026-04-03T05:01:13Z

src/main/java/com/nvidia/spark/rapids/jni/CharsetDecode.java

+  public static ColumnVector decode(ColumnView cv, int charset) {
+    return new ColumnVector(decodeNative(cv.getNativeView(), charset));
+  }


Raw int charset parameter — no validation before JNI call

decode(ColumnView cv, int charset) accepts any arbitrary integer. In CharsetDecodeJni.cpp, this is directly cast to charset_type via static_cast<spark_rapids_jni::charset_type>(charset). An invalid value (e.g., 99) will reach the switch statement and hit default: CUDF_FAIL(...), but only after an unchecked cast from an integer to an enum.

Consider validating the input before passing to native code, or using an enum on the Java side:

public static ColumnVector decode(ColumnView cv, int charset) { if (charset != GBK) { throw new IllegalArgumentException("Unsupported charset: " + charset); } return new ColumnVector(decodeNative(cv.getNativeView(), charset)); }

Alternatively, exposing an enum-typed overload would make misuse a compile-time error.

Fair point. The C++ side already catches invalid values via CUDF_FAIL in the default switch branch, so there is no undefined behavior. The only caller is the GPU plugin which always passes CharsetDecode.GBK. Will add a Java-side validation as defense in depth.

greptile-apps · 2026-04-03T05:01:14Z

src/main/cpp/scripts/GenerateGbkTable.java

+        .onMalformedInput(CodingErrorAction.REPORT)
+        .onUnmappableCharacter(CodingErrorAction.REPORT);
+
+    System.out.println("// GBK to Unicode lookup table.");
+    System.out.println("// Generated by GenerateGbkTable.java using Java's GBK CharsetDecoder.");
+    System.out.println("// Index: (first_byte - 0x81) * 191 + (second_byte - 0x40)");
+    System.out.println("// Entries: 126 * 191 = 24066");
+
+    for (int first = 0x81; first <= 0xFE; first++) {
+      StringBuilder sb = new StringBuilder();
+      for (int second = 0x40; second <= 0xFE; second++) {
+        byte[] bytes = {(byte) first, (byte) second};
+        int cp;
+        try {
+          decoder.reset();
+          CharBuffer cb = decoder.decode(ByteBuffer.wrap(bytes));
+          cp = (cb.length() == 1) ? cb.charAt(0) : 0xFFFD;
+        } catch (Exception e) {
+          cp = 0xFFFD;
+        }
+        if (second > 0x40) {


Generator uses REPORT mode; test ground truth uses REPLACE mode

The generator uses CodingErrorAction.REPORT (throws MalformedInputException/UnmappableCharacterException, caught and mapped to 0xFFFD), while CharsetDecodeTest.decodeGbkJava uses CodingErrorAction.REPLACE. For standard GBK pairs these produce identical tables, but the generator does not handle standalone 0x80 at all (it only iterates first from 0x81 to 0xFE). This is fine for the double-byte table itself, but documents a discrepancy with the test's ground truth function worth noting:

Generator: 0x80 not iterated → not in table.

Java ground truth (decodeGbkJava): 0x80 decoded with REPLACE mode → returns € on most JVMs.

The generator comment should note that 0x80 → U+20AC needs separate handling in the kernel, since it is outside the generated table range.

Good observation on the mode difference. We investigated this thoroughly — the behavioral difference between REPORT and REPLACE modes is limited to:

Second byte 0x7F: REPLACE mode splits (lead, 0x7F) into two characters (FFFD + DEL), while REPORT mode on the isolated 2-byte buffer throws. The kernel excludes 0x7F from valid second bytes (second \!= 0x7F) to match REPLACE behavior.

Second byte 0xFF: REPLACE mode consumes (lead, 0xFF) as a pair (advance 2, emit FFFD), while the table range only covers 0x40-0xFE. The kernel handles this with a special case (second > GBK_SECOND_BYTE_MAX → UNICODE_REPLACEMENT, advance 2).

We validated this by running a Java simulation of the kernel logic against Java REPLACE-mode streaming decode on 100k random byte sequences with zero mismatches.

Regarding 0x80: as noted in the reply above, this JDK (OpenJDK 1.8.0) returns FFFD for standalone 0x80, not €. The test dynamically adapts to the JDK behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

res-life

P1: This looks incorrect for sliced LIST<UINT8> inputs. lists_column_view::offsets() returns the backing offsets column, and passing offsets.data<cudf::size_type>() into gbk_decode_fn does not apply the parent list offset. For a sliced binary column, idx == 0 will therefore use the unsliced row-0 boundaries rather than the first row in the slice, so StringDecode(GBK) can silently return wrong strings. Please use a slice-aware offset accessor (for example offsets_begin(), or otherwise account for input.offset()) and add a regression test that decodes a sliced binary column.

res-life · 2026-04-09T10:10:28Z

Could we test the perf and give some numbers?

thirtiseven · 2026-04-09T10:54:27Z

Could we test the perf and give some numbers?

I added some e2e perf numbers in NVIDIA/spark-rapids#14545

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-04-09T11:18:54Z

P1: This looks incorrect for sliced LIST<UINT8> inputs. lists_column_view::offsets() returns the backing offsets column, and passing offsets.data<cudf::size_type>() into gbk_decode_fn does not apply the parent list offset. For a sliced binary column, idx == 0 will therefore use the unsliced row-0 boundaries rather than the first row in the slice, so StringDecode(GBK) can silently return wrong strings. Please use a slice-aware offset accessor (for example offsets_begin(), or otherwise account for input.offset()) and add a regression test that decodes a sliced binary column.

nice catch! updated.

thirtiseven and others added 2 commits April 3, 2026 12:48

signoff

c52a778

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Add Java-side charset validation before JNI call

405c0c7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thirtiseven mentioned this pull request Apr 3, 2026

Support StringDecode for GBK encoding NVIDIA/spark-rapids#14545

Draft

3 tasks

address comments

bf93b01

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from a team April 7, 2026 09:08

res-life reviewed Apr 9, 2026

View reviewed changes

address comments

5100860

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Conversation

thirtiseven commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

res-life left a comment

Choose a reason for hiding this comment

Uh oh!

res-life commented Apr 9, 2026

Uh oh!

thirtiseven commented Apr 9, 2026

Uh oh!

thirtiseven commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 3, 2026 •

edited

Loading