Add a protocol buffer decode kernel by thirtiseven · Pull Request #4107 · NVIDIA/spark-rapids-jni

thirtiseven · 2025-12-23T08:57:47Z

This PR adds a protocol buffer decoder with a large subset of Proto2 features, to support spark expression from_protobuf.

The code is ready to me but too large to review. I'm splitting it into small parts:

Part 0: #4373

PR Split Plan

This PR is being split into a linear chain of four focused PRs to make review more manageable. Each PR is independently compilable and testable, and each subsequent PR is purely additive to the prior one (mostly inserting new code sections rather than modifying existing logic):

PR 1 — Framework + API + stub decode (~2,500 lines, ~16 tests): Establishes the full public API surface (protobuf.hpp, Protobuf.java, ProtobufSchemaDescriptor.java), the JNI bridge (ProtobufJni.cpp), the shared infrastructure header (protobuf_common.cuh), and a stub decode_protobuf_to_struct that validates the schema and returns a correctly-shaped STRUCT column with all-null children. The shared header is included in full so that no follow-up PR needs to modify it. Review focus: schema validation logic (both Java and C++ sides), JNI memory safety, API contract, and correct null-column type construction for nested/repeated schemas.
PR 2 — Scalar type extraction (~3,700 lines, ~55 tests): Adds the core decode pipeline — scan_all_fields_kernel for single-pass field location recording, batched 2D extraction kernels grouped by type (varint, fixed32/64, zigzag), two-phase string/bytes construction, default value substitution, and required-field checks. Covers all scalar protobuf types (int32, int64, uint32, uint64, sint32/64, fixed32/64, sfixed32/64, float, double, bool, string, bytes). The change is a pure insertion into protobuf.cu between the field classification and the assembly section — no existing code is modified.
PR 3 — Repeated fields + nested messages (~4,400 lines, ~25 tests): Adds the three-phase repeated-field pipeline (count → prefix-sum offsets → scan occurrences → build LIST columns) and recursive nested-message decoding (up to 10 levels deep). Includes repeated-in-nested and repeated-in-repeated support. Again a pure insertion into the orchestrator, between the scalar section and the final assembly.
PR 4 — Enum-as-string + PERMISSIVE mode (~1,900 lines, ~40 tests): Adds enum value validation, varint-to-UTF8-name conversion with GPU-parallel lookup tables, and PERMISSIVE mode null propagation (invalid enum or malformed rows nullify the entire struct row and propagate nulls to all descendants). This is the only PR that modifies a small amount of existing code (adding d_row_force_null tracking to the setup section and null propagation after the assembly).

Benchmarks (8 NVBench cases, ~1,400 lines) will be submitted separately or appended to PR 4.

Summary

This PR adds a GPU-accelerated protobuf decoder that converts LIST<INT8/UINT8> columns (one serialized protobuf message per row) into nested cuDF STRUCT columns via JNI. This is the native kernel layer that powers from_protobuf() GPU acceleration in the spark-rapids plugin.

The implementation spans ~6,500 lines of new CUDA/C++/Java code and ~3,900 lines of tests, organized into a clean four-file C++ architecture plus a validated Java schema API.

Key capabilities

All scalar protobuf types: int32, int64, uint32, uint64, sint32/sint64 (zigzag), fixed32/sfixed32/fixed64/sfixed64, float, double, bool, string, bytes
Nested messages: up to 10 levels deep, recursive decode
Repeated fields: both packed and unpacked encoding, auto-detected per-row
Repeated messages: ArrayType(StructType) — repeated nested messages with arbitrary child fields
Repeated-in-nested: repeated fields inside nested messages, repeated fields inside repeated messages
Enum-as-string: varint → validated enum → UTF-8 string name conversion, with lookup tables for GPU-parallel name resolution
Default values: per-field defaults for all scalar types and strings
Required field validation: proto2-style required field checks
PERMISSIVE / FAILFAST modes: configurable error handling — permissive mode nullifies malformed rows instead of throwing; invalid enum values nullify the entire struct row and propagate nulls to all descendants
Schema projection ready: the flattened schema representation supports decoding arbitrary subsets of fields

Performance characteristics

Multi-pass algorithm optimized for GPU occupancy: many simple kernels > one complex kernel
O(1) field-number lookup tables for scan/count kernels (up to FIELD_LOOKUP_TABLE_MAX = 4096)
Batched scalar extraction groups fields by type to minimize kernel launches
Two-phase string construction (compute lengths → prefix sum → copy) avoids pre-allocation guessing

Architecture

File structure

src/main/cpp/src/
├── protobuf.hpp            (279 lines)  Public API: types, context, validation
├── protobuf_common.cuh    (1823 lines)  Shared types, device helpers, template kernels
├── protobuf_kernels.cu    (1307 lines)  Non-template CUDA kernels
├── protobuf_builders.cu   (1719 lines)  Column builder functions
├── protobuf.cu            (1196 lines)  Entry point: decode_protobuf_to_struct
└── ProtobufJni.cpp         (278 lines)  JNI bridge

src/main/java/.../jni/
├── Protobuf.java            (116 lines)  Java public API
└── ProtobufSchemaDescriptor.java (319 lines)  Immutable schema with validation

src/test/java/.../jni/
├── ProtobufTest.java       (3565 lines)  107 decode tests
└── ProtobufSchemaDescriptorTest.java (338 lines)  13 schema validation tests

src/main/cpp/benchmarks/
└── protobuf_decode.cu      (1322 lines)  8 NVBench benchmarks

Dependency graph

Protobuf.java ──► ProtobufSchemaDescriptor.java
     │
     │ JNI
     ▼
ProtobufJni.cpp ──► protobuf.hpp
                         │
                         ▼
                    protobuf.cu (entry point, orchestration)
                         │
                         │ #include
                         ▼
                    protobuf_common.cuh (shared foundation)
                    ▲            ▲
                    │            │
          protobuf_kernels.cu  protobuf_builders.cu

Multi-pass decode algorithm

The decoder processes each batch of messages through multiple GPU passes:

Count pass (count_repeated_fields_kernel): One thread per row. Scans message bytes to count repeated field occurrences and record nested message locations. Handles both packed and unpacked repeated encoding.
Offset pass (thrust::exclusive_scan): Prefix sum on repeated counts to compute output array offsets.
Scan pass (scan_all_fields_kernel + scan_all_repeated_occurrences_kernel): Records exact byte locations (offset + length) for every target field. Last-one-wins semantics for duplicate scalar fields.
Extract pass (type-specific kernels): Parallel data extraction using pre-computed locations. Batched 2D kernel launches group fields by type (varint, fixed32, fixed64, zigzag, etc.) to minimize launch overhead.
Build pass (recursive column builders): Assembles cuDF columns bottom-up. Nested structs and repeated messages are processed recursively up to MAX_NESTED_STRUCT_DECODE_DEPTH = 10.

Flattened schema representation

The protobuf schema is represented as parallel arrays passed through JNI. Fields are ordered in pre-order traversal (parent before children):

Array	Description
`fieldNumbers[]`	Protobuf field numbers
`parentIndices[]`	Parent index in flat array (-1 for top-level)
`depthLevels[]`	Nesting depth (0 for top-level)
`wireTypes[]`	Expected protobuf wire type (0=varint, 1=64bit, 2=len, 5=32bit)
`outputTypeIds[]`	cuDF type IDs for output columns
`encodings[]`	Encoding (0=default, 1=fixed, 2=zigzag, 3=enum_string)
`isRepeated[]`	Whether field is repeated (output becomes LIST)
`isRequired[]`	Whether field is required (proto2)
`hasDefaultValue[]`	Whether a default value exists
`defaultInts/Floats/Bools/Strings[]`	Default values per field
`enumValidValues[][]`	Sorted valid enum values per field (for binary search)
`enumNames[][][]`	Enum name UTF-8 bytes per field (for enum-as-string)

Example for message Outer { int32 a = 1; Inner b = 2; } message Inner { int32 x = 1; string y = 2; }:

Index 0: a  (parentIdx=-1, depth=0, wireType=VARINT, type=INT32)
Index 1: b  (parentIdx=-1, depth=0, wireType=LEN,    type=STRUCT)
Index 2: x  (parentIdx=1,  depth=1, wireType=VARINT, type=INT32)
Index 3: y  (parentIdx=1,  depth=1, wireType=LEN,    type=STRING)

Test coverage

107 JUnit tests in ProtobufTest.java + 13 tests in ProtobufSchemaDescriptorTest.java, organized by feature:

Category	Tests	What is covered
Basic scalar types	3	INT32/64, FLOAT32/64, BOOL, STRING end-to-end
Varint & zigzag	9	Max values, zero, over-encoded zero, 10th-byte validation, zigzag min/max/negative
Wire format errors	14	Malformed varint, truncated fields (varint/string/fixed32/fixed64), partial data, wrong wire type
Unknown field skip	4	Skip varint, fixed32, fixed64, length-delimited unknowns
Last-one-wins	2	Duplicate field handling for scalars and strings
Float/double specials	2	NaN, +Inf, -Inf
Schema projection	2	Partial field decode, decode-none
Required fields	12	Present/missing in permissive/failfast, multi-row, nested required, absent parent skip
Default values	13	All scalar types, strings, empty strings, mixed defaults, multi-row
Repeated fields	10	Unpacked int32, packed double, uint32/64, packed-in-nested, packed-in-repeated-message
Nested messages	6	1-level, 3-level deep, repeated-inside-nested, repeated-in-repeated
Enum (as INT32)	8	Valid, zero, unknown, negative, multiple fields, missing
Enum-as-string	18	Valid/unknown/mixed, permissive null propagation, repeated enum, nested repeated enum, sibling field visibility
FAILFAST mode	13	All error types throw, valid data does not throw
Packed edge cases	4	Misaligned packed fixed32/64, large repeated, mixed packed/unpacked
Deep nesting	6	9-level, 10-level, zero-length nested, empty packed, large field numbers
Schema descriptor	13	Repeated+default reject, struct/list default reject, enum metadata, duplicate fields, encoding compat, depth limit, serialization roundtrip
Performance	1	Multi-field batched extraction correctness

Benchmarks

8 NVBench benchmarks in protobuf_decode.cu:

Benchmark	What it stresses
`BM_protobuf_flat_scalars`	Top-level scalar extraction throughput
`BM_protobuf_nested`	Nested message recursive decode
`BM_protobuf_repeated`	Top-level repeated field count/scan/extract
`BM_protobuf_wide_repeated_message`	Wide repeated struct (many children)
`BM_protobuf_repeated_child_lists`	Repeated-in-repeated (nested LIST)
`BM_protobuf_repeated_child_string_count_scan`	Nested repeated string count+scan isolation
`BM_protobuf_repeated_child_string_build`	Nested repeated string build isolation
`BM_protobuf_many_repeated`	Many independent repeated fields

Review Guide

This PR is large (~12,000 lines total) but has a well-defined layered architecture. This guide provides a recommended reading order, key areas to focus on per file, and a mental model for understanding the code.

Key review areas by priority

P0: Correctness-critical

scan_all_fields_kernel (protobuf_kernels.cu): This is the most important kernel. One thread per row, scans all bytes of each message, records field locations. Key things to verify:
- Last-one-wins semantics for duplicate fields
- Correct wire type dispatch and field skipping
- Bounds checking at every byte read (cur < msg_end)
- Error flag setting uses atomicCAS (no races)
- Permissive mode: row_has_invalid_data is set on parse errors so the row can be nullified
count_repeated_fields_kernel (protobuf_kernels.cu): Counts repeated field occurrences. Must correctly distinguish packed vs unpacked encoding. Packed detection: wire_type == WT_LEN but expected_wire_type != WT_LEN.
build_nested_struct_column (protobuf_builders.cu): Recursive builder for nested messages. Verify:
- Depth limit enforcement (MAX_NESTED_STRUCT_DECODE_DEPTH)
- Correct parent-child location derivation
- is_repeated children inside nested messages get proper LIST wrapping
- 0-row / empty-child edge cases create correctly typed columns
PERMISSIVE mode null propagation (protobuf.cu, lines ~1163-1190): Invalid enum values must nullify the entire struct row AND propagate to all descendants via apply_parent_mask_to_row_aligned_column + propagate_nulls_to_descendants.
JNI memory safety (ProtobufJni.cpp): Every GetObjectArrayElement / GetByteArrayElements / GetIntArrayElements must have a matching DeleteLocalRef / ReleaseXxxArrayElements. Verify no leaks in the enum_names triple-nested loop.

P1: Robustness

validate_decode_context (protobuf.hpp): Validates all schema invariants before any GPU work. Check completeness — duplicate field numbers under same parent, encoding compatibility, enum metadata non-empty for ENC_ENUM_STRING.
ProtobufSchemaDescriptor.validate (Java): Mirrors C++ validation. Defensive copies in constructor, re-validation on deserialization.
Varint parsing (protobuf_common.cuh read_varint): 10th byte must only use lowest bit. Truncated/malformed varints must return false.
Wire type handling (skip_field, get_wire_type_size): Verify WT_SGROUP uses iterative handling with depth cap of 32 (not recursive). WT_EGROUP is rejected as standalone.

P2: Performance

Batched scalar extraction (protobuf.cu, lines ~400-470): Fields are grouped into 12 categories by type and extracted with 2D kernel launches. Verify grouping logic covers all type+encoding combinations.
Field lookup tables (build_field_lookup_table, build_index_lookup_table in protobuf_common.cuh): O(1) field_number → index mapping when max field number ≤ FIELD_LOOKUP_TABLE_MAX. Falls back to linear scan otherwise.
String two-phase construction: extract_lengths_kernel → make_offsets_child_column → copy_varlen_data_kernel. Verify no off-by-one in offset calculations.

Things to watch for

Evaluation order: Never call uvector.size() after uvector.release() in the same expression. The code caches sizes before releasing.
Offsets column size: LIST columns require offsets with exactly num_rows + 1 elements.
Packed repeated in nested: count_repeated_in_nested_kernel and scan_repeated_in_nested_kernel handle both packed and unpacked within nested message boundaries.
Empty struct children: When building 0-row struct columns, repeated children must still be wrapped in empty LIST columns to maintain correct schema.
cudf::logic_error for data errors: The code uses cudf::logic_error for wire-format errors in strict mode. This is semantically imprecise (it conventionally signals API misuse), but functionally correct.

Mapping review to test coverage

If you're reviewing...	Verify these test categories pass
`scan_all_fields_kernel`	Basic scalar, varint/zigzag, wire format errors, unknown field skip, last-one-wins
`count/scan_repeated`	Repeated fields, packed edge cases
`build_nested_struct_column`	Nested messages, deep nesting
`validate_enum + enum-string`	Enum (INT32), enum-as-string, PERMISSIVE mode
`check_required_fields_kernel`	Required fields
Default value handling	Default values
JNI bridge	All tests (they all go through JNI)
Schema validation	ProtobufSchemaDescriptorTest

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

This PR adds a GPU-accelerated protocol buffer decoder with intentionally limited features, focusing on simple scalar field types. The implementation provides a JNI interface for decoding binary protobuf messages into cuDF STRUCT columns.

Key changes:

Implements GPU kernels for decoding protobuf varint, fixed32/64, and length-delimited (string) fields
Adds JNI bindings between Java and CUDA implementation
Provides basic test coverage for INT64 and STRING field types

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSimple.java	Java API providing `decodeToStruct()` method with parameter validation
src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java	Basic test case covering varint (INT64) and string decoding with missing fields and null messages
src/main/cpp/src/protobuf_simple.hpp	C++ API declaration with documentation of supported types
src/main/cpp/src/protobuf_simple.cu	CUDA implementation with three specialized kernels for varint, fixed-width, and string extraction
src/main/cpp/src/ProtobufSimpleJni.cpp	JNI bridge translating Java arrays to C++ vectors and invoking decode logic
src/main/cpp/CMakeLists.txt	Build configuration adding new source files to compilation targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/main/cpp/src/protobuf.cu

src/main/cpp/src/protobuf_simple.hpp

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/cpp/src/protobuf_simple.cu

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2025-12-23T09:22:25Z

@greptile full review

greptile-apps · 2025-12-23T09:25:51Z

Greptile Summary

This PR completes the core protobuf decode pipeline — scan, count, extract, and build passes — that was stubbed in Part 0. After several review cycles all previously identified correctness bugs (WT_VARINT error propagation in scan_repeated_message_children_kernel, validate_enum_values_kernel missing from the repeated-enum path, fast-path depth asymmetries, ERR_OVERFLOW/ERR_SCHEMA_TOO_LARGE confusion, JNI local-ref leaks, and others) appear to be properly addressed in the current code. The remaining observations are P2 quality/style points: a per-element binary-search in the PERMISSIVE-mode null-propagation helper and an undocumented behavioral asymmetry between scalar and repeated-field error handling in PERMISSIVE mode.

Confidence Score: 5/5

This PR is safe to merge; all previously identified P0/P1 bugs from earlier review rounds have been addressed in the current code

All remaining findings are P2 quality/style observations (a per-element binary search in the PERMISSIVE null-propagation helper, and an undocumented behavioral asymmetry between scalar and repeated-field error handling). No new correctness bugs were found after thorough review of the scan kernels, builder functions, and orchestration logic.

protobuf.cu (propagate_list_nulls_to_descendants performance and PERMISSIVE-mode error handling asymmetry), protobuf_kernels.cu (general complexity warrants continued attention)

Important Files Changed

Filename	Overview
src/main/cpp/src/protobuf/protobuf_kernels.cu	Adds scan_all_fields_kernel, count/scan repeated-field kernels, scan_nested_message_fields_kernel, scan_repeated_message_children_kernel (WT_VARINT error check now present), check_required_fields_kernel, and enum-string kernels; all previously flagged fast-path depth checks and ERR_SCHEMA_TOO_LARGE guard are correctly in place
src/main/cpp/src/protobuf/protobuf_builders.cu	Adds repeated scalar/string/enum-string/struct column builders; validate_enum_values_kernel is now correctly called in build_repeated_enum_string_column; null propagation and row-force-null plumbing looks correct across nested and repeated paths
src/main/cpp/src/protobuf/protobuf.cu	Orchestration entry point wiring scan→count→extract→build passes; null-propagation helpers are correctly structured; ERR_REPEATED_COUNT_MISMATCH always throws (even in PERMISSIVE mode) which is consistent but asymmetric with scalar error handling
src/main/cpp/src/protobuf/protobuf_kernels.cuh	Adds host-wrapper declarations for all new kernels and minor refactor in build_repeated_scalar_column; changes are clean and consistent
src/main/cpp/src/protobuf/protobuf_host_helpers.hpp	Adds get_output_type_id helper template and updates make_empty_struct_column_with_schema to use it; straightforward and correct
src/main/cpp/benchmarks/protobuf_decode.cu	Adds 8 NVBench benchmarks with correct wire-format helpers; encode_tag correctly casts to uint64_t before shift; checked_size_to_i32 guards against oversized benchmark inputs
src/main/cpp/benchmarks/CMakeLists.txt	Adds PROTOBUF_DECODE_BENCH target; three-line addition, correct
src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java	Adds ~3,400 lines of new JUnit tests covering all scalar types, wire-format errors, varint/zigzag, repeated fields, nested messages, enum-as-string, PERMISSIVE/FAILFAST modes, default values, and required fields; no disabled tests

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["decode_protobuf_to_struct\n(protobuf.cu)"] --> B[validate_decode_context]
    B --> C{num_rows == 0?}
    C -- Yes --> D[Return empty STRUCT]
    C -- No --> E[Classify fields:\nscalar / nested / repeated]
    E --> F["count_repeated_fields_kernel\n(one thread/row, counts occurrences\n+ records nested locations)"]
    F --> G["thrust::exclusive_scan\n(compute repeated offsets)"] 
    G --> H["scan_all_fields_kernel\n(one thread/row, records scalar\nfield locations — last-one-wins)"]
    H --> I["scan_all_repeated_occurrences_kernel\n(one thread/row, records all\nrepeated occurrences)"] 
    I --> J[Batched scalar extraction\nGrouped by type: varint / fixed32/64 /\nzigzag / string / bytes / enum-string]
    J --> K["check_required_fields_kernel\n(FAILFAST: throw on missing)"] 
    K --> L["build_nested_struct_column\n(recursive, depth-limited to 10)"]
    L --> M["build_repeated_struct_column /\nbuild_repeated_scalar_column /\nbuild_repeated_string_column"]
    M --> N{PERMISSIVE mode?}
    N -- Yes --> O["valid_if(d_row_force_null)\n→ struct null mask"]
    O --> P["apply_parent_mask + propagate_nulls\n(top-level → children → grandchildren)"]
    P --> Q[Return STRUCT column]
    N -- No --> Q

_{Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.}

_{Reviews (55): Last reviewed commit: "style" | Re-trigger Greptile}

greptile-apps

Additional Comments (6)

src/main/cpp/src/protobuf_simple.cu, line 89-91 (link)

logic: potential overflow: len64 can be up to 2^64-1, but casting to int on line 90 can overflow if len64 > INT_MAX
src/main/cpp/src/protobuf_simple.cu, line 323-324 (link)

logic: potential overflow: len64 can be larger than INT_MAX, but casting to int on line 324 will overflow
src/main/cpp/src/protobuf_simple.cu, line 375-376 (link)

logic: race condition: multiple threads write to *error_flag without atomics, causing undefined behavior when multiple threads encounter errors simultaneously

Then in kernels, use atomicOr(error_flag, 1) instead of *error_flag = 1
src/main/cpp/src/protobuf_simple.cu, line 398-407 (link)

logic: protobuf uses zigzag encoding for signed integers (sint32/sint64), but varint decoding here treats them as unsigned - decoding negative values will produce incorrect results. Are you only supporting unsigned int32/int64, or should zigzag decoding be implemented for signed types?
src/main/cpp/src/protobuf_simple.cu, line 240 (link)

syntax: type punning through reinterpret_cast of incompatible pointer types is undefined behavior in C++
src/main/cpp/src/protobuf_simple.cu, line 248 (link)

syntax: type punning through reinterpret_cast of incompatible pointer types is undefined behavior in C++

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2025-12-25T03:43:10Z

@greptile full review

greptile-apps

Additional Comments (2)

src/main/cpp/src/protobuf_simple.cu, line 186 (link)

logic: zigzag decode uses signed right shift which is correct for signed types, but applied to unsigned v
src/main/cpp/src/ProtobufSimpleJni.cpp, line 55 (link)

logic: encodings is constructed from n_type_scales but then out_types is also constructed using n_type_scales[i] as the scale parameter, which would be wrong for non-decimal types where this represents encoding

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/main/cpp/src/protobuf_simple.cu

src/main/cpp/src/protobuf.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/cpp/src/protobuf.cu

src/main/cpp/src/ProtobufSimpleJni.cpp

src/main/cpp/src/protobuf_simple.cu

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/main/cpp/src/protobuf_simple.hpp

src/main/cpp/src/protobuf_simple.cu

src/main/cpp/src/protobuf.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/cpp/src/protobuf.cu

src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java

src/main/cpp/src/protobuf_simple.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

…ven/spark-rapids-jni into protocol_buffer_jni_dev

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-01-12T07:33:48Z

@greptileai full review

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

src/main/cpp/src/protobuf.cu

src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

nvauto · 2026-01-19T02:11:57Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

res-life · 2026-03-16T08:39:06Z

src/main/cpp/src/protobuf.hpp

+namespace spark_rapids_jni {
+
+// Encoding constants
+constexpr int ENC_DEFAULT      = 0;


Create an enum to improve readability like:

enum class type_id : int32_t

res-life · 2026-03-16T09:51:44Z

src/main/cpp/src/protobuf.hpp

+constexpr int MAX_FIELD_NUMBER = (1 << 29) - 1;
+
+// Wire type constants
+constexpr int WT_VARINT = 0;


create an enum.

res-life · 2026-03-16T10:00:34Z

src/main/cpp/src/protobuf/protobuf.cu

+                                             cudf::size_type num_rows,
+                                             rmm::cuda_stream_view stream,
+                                             rmm::device_async_resource_ref mr)
+{


Add a fast checking path to avoid unnessary GPU tasks.
if (parent_null_count == 0) return;

res-life · 2026-03-16T10:02:08Z

src/main/cpp/src/protobuf/protobuf.cu

+      stream);
+    col.set_null_count(child_view.size() - valid_count);
+  } else {
+    auto child_mask = cudf::detail::copy_bitmask(parent_mask_ptr, 0, num_rows, stream, mr);


The if path considers view.offset, but else path does not, is this intended?

Yes that's intended, added a CUDF_EXPECTS to ensure it.

words from ai:

apply_parent_mask_to_row_aligned_column takes an owning cudf::column&, not a sliced column_view. For an owning column, column::view() / mutable_view() constructs a view with offset = 0, and child() likewise returns the owning child column rather than a sliced child view.

So in the current protobuf decoder path both branches are effectively operating on offset-0 columns.

The nullable branch still passes child_view.offset() because inplace_bitmask_and is a generic view-oriented API; in this call site that value is always 0. I added the CUDF_EXPECTS(child_view.offset() == 0, ...) in the non-nullable branch to make that invariant explicit and to prevent future misuse if this helper is ever reused on non-owning/sliced inputs.`

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revans2 · 2026-03-17T14:16:18Z

How does the performance compare to the CPU? It is great that we have this all working. I agree that it is very large, but unless it is faster than running it on the CPU, then there is not much of a reason to do it. I am especially concerned about large protobuffers and the memory access pattern. A thread per entry is potentially very problematic, as we found for large JSON strings in other work that is similar to this. How does validation work? I didn't see anything in there about detecting malformed protobufs and returning an error.

thirtiseven · 2026-03-18T02:46:13Z

Thanks for the review @revans2

How does the performance compare to the CPU? It is great that we have this all working. I agree that it is very large, but unless it is faster than running it on the CPU, then there is not much of a reason to do it.

Performance is looking positive by far. We're seeing 3~5x speedup over CPU on a real customer schema (~200 field nested message with repeated structs, ~500 output columns). The customer has also confirmed cost savings in their end-to-end testing (using different schemas).

I'll add more detailed benchmark numbers in the plugin pr NVIDIA/spark-rapids#14354 description soon.

I am especially concerned about large protobuffers and the memory access pattern. A thread per entry is potentially very problematic, as we found for large JSON strings in other work that is similar to this.

Yes it could be problematic for large protobuffers, but the situation is generally better than it was with getJsonObject.

Protocol buffer messages are typically smaller and more like records than tables (not like in JSON), and they are similar in size, at least for the customer.
Protobuf is in binary TLV format, so there is less branching and heavy string processing in the Scan and Count passes. it is basically an O(n) scan. And in the Extract and Build passes, we don't use the a-thread-per-entry pattern.

I will add some large protobuffers testing in performance testsing to see how bad it is.

How does validation work? I didn't see anything in there about detecting malformed protobufs and returning an error.

Yes there are many validations in the code. We have 12 distinct error codes, which cover all Protobuf wire-format failure modes:

    switch (code) {
      case ERR_BOUNDS: return "Protobuf decode error: message data out of bounds";
      case ERR_VARINT: return "Protobuf decode error: invalid or truncated varint";
      case ERR_FIELD_NUMBER: return "Protobuf decode error: invalid field number";
      case ERR_WIRE_TYPE: return "Protobuf decode error: unexpected wire type";
      case ERR_OVERFLOW: return "Protobuf decode error: length-delimited field overflows message";
      case ERR_FIELD_SIZE: return "Protobuf decode error: invalid field size";
      case ERR_SKIP: return "Protobuf decode error: unable to skip unknown field";
      case ERR_FIXED_LEN:
        return "Protobuf decode error: invalid fixed-width or packed field length";
      case ERR_REQUIRED: return "Protobuf decode error: missing required field";
      case ERR_SCHEMA_TOO_LARGE:
        return "Protobuf decode error: schema exceeds maximum supported repeated fields per kernel "
               "(128)";
      case ERR_MISSING_ENUM_META:
        return "Protobuf decode error: missing or mismatched enum metadata for enum-as-string "
               "field";
      case ERR_REPEATED_COUNT_MISMATCH:
        return "Protobuf decode error: repeated-field count/scan mismatch";
      default: return "Protobuf decode error: unknown error";

And two error modes are supported, matching Spark's from_protobuf semantics: FAILFAST (fail_on_errors=true) and PERMISSIVE (fail_on_errors=false)

There are some related java unit tests: testWrongWireType, testSkipUnknownVarintField , testFieldNumberZeroInvalid, etc.

revans2 · 2026-03-18T11:45:44Z

That is really great work. Thanks for the detailed analysis.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

# Conflicts: # thirdparty/cudf

Resolve conflicts from part0 (PR NVIDIA#4373) merge. Take main's reviewed conventions: detail namespace, snake_case, exec_policy_nosync, cub::DeviceMemcpy, pinned vectors, no-brace style. Keep dev branch's full implementation code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maybe_check_required_fields, propagate_invalid_enum_flags_to_rows, and validate_enum_and_propagate_rows no longer take a mr parameter (changed in part0 review). Update all callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove braces from single-statement if bodies to match the convention established during part0 review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy_varlen_data_kernel was removed in part0 review in favor of cub::DeviceMemcpy::Batched. Update the two remaining call sites in protobuf_builders.cu (repeated msg child varlen and repeated string column builders). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These were only needed by the removed copy_varlen_data_kernel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ai draft

4aa7dbb

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 23, 2025 08:57

Copilot started reviewing on behalf of thirtiseven December 23, 2025 08:58 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

thirtiseven mentioned this pull request Dec 23, 2025

Add partial support for from_protobuf NVIDIA/spark-rapids#14062

Closed

3 tasks

style

a319d10

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Dec 23, 2025

View reviewed changes

thirtiseven added 2 commits December 25, 2025 10:58

address comments

458b583

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

5c1bbf4

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 25, 2025 03:43

Copilot started reviewing on behalf of thirtiseven December 25, 2025 03:43 View session

greptile-apps bot reviewed Dec 25, 2025

View reviewed changes

Copilot AI reviewed Dec 25, 2025

View reviewed changes

address comments

0445daa

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 25, 2025 06:56

Copilot started reviewing on behalf of thirtiseven December 25, 2025 06:57 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

thirtiseven added 6 commits December 25, 2025 15:39

address comments

8ddbf96

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

copyrights

c9eea59

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'main' into protocol_buffer_jni_dev

f27fd6e

update, added more tests

617471d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'protocol_buffer_jni_dev' of https://github.com/thirtise…

bdfac06

…ven/spark-rapids-jni into protocol_buffer_jni_dev

style

3701cdf

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

address comments

1bf7660

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven added 2 commits March 16, 2026 11:53

address comments

5ef84e9

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

4817d06

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven mentioned this pull request Mar 16, 2026

Support from_protobuf expression NVIDIA/spark-rapids#14354

Draft

3 tasks

address comments

bdd2e40

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

res-life reviewed Mar 16, 2026

View reviewed changes

address human comments

6101a4b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven added 4 commits March 19, 2026 10:12

port back refactor from pr 0

0213433

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

style

6bcac2f

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments from part0

ff83290

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

nghia style self-check

35b1cb8

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven mentioned this pull request Mar 20, 2026

from_protobuf kernel (part0): framework, API, and schema validation #4373

Merged

thirtiseven and others added 13 commits March 20, 2026 21:16

backport suggestions

82bf02a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

fe73ada

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

copyright

d01a25b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

fix

54223ca

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

apply comments suggestions

70a214b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

apply refactor on headers

6e00831

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge remote-tracking branch 'origin/main' into protocol_buffer_jni_dev

1754032

# Conflicts: # thirdparty/cudf

Apply part0 brace style to dev-only code

2e91499

Remove braces from single-statement if bodies to match the convention established during part0 review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused threads/blocks variables

33e30f7

These were only needed by the removed copy_varlen_data_kernel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style

7508a80

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Order	File	Focus	Time estimate
1	`protobuf.hpp`	Data structures, API contract, validation logic	15 min
2	`ProtobufSchemaDescriptor.java`	Java-side schema validation (mirrors C++ validation)	15 min
3	`Protobuf.java`	Public Java API, PERMISSIVE mode semantics	5 min
4	`ProtobufJni.cpp`	JNI bridge: array conversion, local ref management	15 min
5	`protobuf_common.cuh` §1: types	Lines 54–165: `field_location`, `device_nested_field_descriptor`, etc.	10 min
6	`protobuf_common.cuh` §2: device helpers	Lines 167–400: `read_varint`, `skip_field`, `get_field_data_location`, `decode_tag`, `lookup_field`	20 min
7	`protobuf_common.cuh` §3: LocationProviders	Lines ~400–650: `TopLevelLocationProvider`, `NestedLocationProvider`, etc. — these abstract how extraction kernels compute byte offsets	15 min
8	`protobuf_common.cuh` §4: template kernels	`extract_varint_kernel`, `extract_fixed_kernel`, `extract_lengths_kernel`, `copy_varlen_data_kernel`, batched variants	20 min
9	`protobuf_common.cuh` §5: template host functions	`extract_typed_column`, `build_repeated_scalar_column`, `extract_and_build_string_or_bytes_column`, `validate_enum_and_propagate_rows`	20 min
10	`protobuf_kernels.cu` §1: scan	`scan_all_fields_kernel` — the core single-pass field scanner	20 min
11	`protobuf_kernels.cu` §2: count/scan repeated	`count_repeated_fields_kernel`, `scan_all_repeated_occurrences_kernel`, shared `__device__` helpers	20 min
12	`protobuf_kernels.cu` §3: nested	`scan_nested_message_fields_kernel`, `scan_repeated_message_children_kernel`, compute kernels	20 min
13	`protobuf_kernels.cu` §4: validation	`check_required_fields_kernel`, `validate_enum_values_kernel`, enum-string kernels	10 min
14	`protobuf_builders.cu` §1: utilities	`make_null_column`, `make_empty_column_safe`, `make_null_list_column_with_child`	10 min
15	`protobuf_builders.cu` §2: enum-string	`make_enum_string_lookup_tables`, `build_enum_string_column`, `build_repeated_enum_string_column`	15 min
16	`protobuf_builders.cu` §3: nested struct	`build_nested_struct_column` — most complex builder, recursive depth handling	25 min
17	`protobuf_builders.cu` §4: repeated struct	`build_repeated_struct_column`, `build_repeated_child_list_column` — repeated-in-repeated	20 min
18	`protobuf.cu`	Entry point `decode_protobuf_to_struct`: orchestration, batched scalar extraction, PERMISSIVE null propagation	30 min
19	`ProtobufTest.java`	Tests — skim by category, deep-read the tricky ones (enum-as-string, nested repeated)	30 min
20	`protobuf_decode.cu`	Benchmarks — helper encoders and benchmark configurations	15 min

Conversation

thirtiseven commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Split Plan

Summary

Key capabilities

Performance characteristics

Architecture

File structure

Dependency graph

Multi-pass decode algorithm

Flattened schema representation

Test coverage

Benchmarks

Review Guide

Recommended reading order

Key review areas by priority

P0: Correctness-critical

P1: Robustness

P2: Performance

Things to watch for

Mapping review to test coverage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Dec 23, 2025

Uh oh!

greptile-apps bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (6)

Uh oh!

thirtiseven commented Dec 25, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Dec 23, 2025 •

edited

Loading

greptile-apps bot commented Dec 23, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading