from_protobuf kernel (part0): framework, API, and schema validation by thirtiseven · Pull Request #4373 · NVIDIA/spark-rapids-jni

thirtiseven · 2026-03-16T10:15:04Z

Add GPU protobuf decoder: framework, API, and schema validation [1/4]

Summary

First PR in a four-part series adding a GPU-accelerated protobuf decoder to spark-rapids-jni. This PR establishes the public API, schema validation, JNI bridge, shared CUDA infrastructure, and a stub decode_protobuf_to_struct entry point.

The decoder converts LIST<INT8/UINT8> columns (one serialized protobuf message per row) into nested cuDF STRUCT columns. The stub in this PR returns correctly-typed all-null columns for each schema field; actual data extraction is added in follow-up PRs.

What is included

C++ public API (protobuf.hpp): nested_field_descriptor, ProtobufDecodeContext, typed proto_encoding / proto_wire_type enums, validate_decode_context() with comprehensive schema invariant checks (field numbers, parent-child topology, depth, wire type / encoding / output type compatibility, repeated && required rejection, sorted enum_valid_values).
Java schema API (Protobuf.java + ProtobufSchemaDescriptor.java): immutable schema descriptor with defensive deep copies, full validation mirroring C++, Serializable with re-validation on deserialization. Public decodeToStruct() method with PERMISSIVE / FAILFAST mode support.
JNI bridge (ProtobufJni.cpp): converts 15 Java arrays to C++ ProtobufDecodeContext with null checks, field number range validation, wire type validation, and proper DeleteLocalRef for all JNI local references including triple-nested enum_names.
Shared CUDA header (protobuf_common.cuh): types, device helpers (read_varint, skip_field, decode_tag), LocationProvider structs, template extraction kernels, and forward declarations. Included in full so that follow-up PRs do not need to modify this header — eliminating merge conflicts across the PR chain.
Stub decode (protobuf.cu): validates context, handles empty-schema and zero-row edge cases with correct nested type construction, propagates input null mask to output struct, and assembles a STRUCT with all-null children via recursive make_null_column_with_schema (respects nested STRUCT children and repeated LIST wrapping).
Column utilities (protobuf_builders.cu): make_null_column (all types), make_empty_column_safe, make_null_list_column_with_child, make_empty_list_column.

Test coverage (26 tests)

Schema validation (13 tests): repeated+default rejection, repeated+required rejection, struct/list default rejection, enum metadata requirements, duplicate field numbers, child-parent constraints, encoding compatibility, depth limit, serialization roundtrip.
Output shape and null semantics (13 tests): empty schema, single/multi-field schemas, multiple rows, null input row propagation (verified with isNull assertions), all-null input, zero-row with flat/nested/repeated schemas (including grandchild type verification), repeated scalar LIST wrapping, input validation.

Follow-up PRs

This PR — Framework + API + stub decode
Scalar extraction — scan_all_fields_kernel, batched extraction, all scalar types, defaults, required fields
Repeated + nested — count/scan/build pipeline, recursive nested decode up to depth 10
Enum-as-string + PERMISSIVE — enum validation, varint-to-UTF8 conversion, null propagation

Each follow-up inserts decode logic into the column_map section of protobuf.cu without modifying existing code. protobuf_common.cuh is frozen after this PR.

Review guide

protobuf.hpp — Start here. The API contract: nested_field_descriptor, ProtobufDecodeContext, validate_decode_context().
ProtobufSchemaDescriptor.java — Java-side validation mirror. Check defensive copies and deserialization re-validation.
ProtobufJni.cpp — Focus on local reference cleanup in the enum_names triple-nested loop.
protobuf.cu — Stub decode. Verify: empty-schema returns num_rows with 0 children; zero-row builds correct nested types; null mask is propagated from input; make_null_column_with_schema recursively builds STRUCT children and LIST wrapping.
protobuf_common.cuh — For this PR, focus on types and make_empty_struct_column_with_schema / find_child_field_indices. The rest (device helpers, template kernels, LocationProviders) is infrastructure for follow-up PRs.
Tests — ProtobufSchemaDescriptorTest for validation, ProtobufTest for output shape and null semantics.

greptile-apps · 2026-03-16T10:22:13Z

Greptile Summary

This PR establishes the full framework for a GPU-accelerated protobuf decoder: public C++ API (protobuf.hpp), an immutable Java schema descriptor with deep-copy and re-validation on deserialization, a JNI bridge, shared CUDA infrastructure headers, and a stub decode_protobuf_to_struct that returns correctly-typed all-null columns. All structural issues raised in prior rounds (zero-row repeated scalar shape, nested STRUCT child construction, repeated && required and sorted-enum-values validation) have been addressed and are covered by the expanded test suite.

Confidence Score: 5/5

Safe to merge; all prior P0/P1 issues are resolved and the one remaining finding is a minor P2 resource-safety note in JNI helpers.

The major structural bugs from earlier review rounds (wrong column type for zero-row repeated scalars, zero-child STRUCT assembly, missing repeated+required and enum sort checks) are all fixed and tested. The only new finding is a low-probability resource leak in jni_byte_array_to_vector / jni_int_array_to_vector if make_host_vector throws under OOM — a P2 hardening item that does not block merge.

src/main/cpp/src/ProtobufJni.cpp — minor RAII hardening for JNI array element release.

Important Files Changed

Filename	Overview
src/main/cpp/src/ProtobufJni.cpp	JNI bridge: null checks, array-size validation, and local-ref cleanup are correct. Minor resource-leak risk in jni_byte_array_to_vector / jni_int_array_to_vector if make_host_vector throws.
src/main/cpp/src/protobuf/protobuf.cu	Stub decode: zero-row repeated scalar/STRUCT, non-zero-row nested STRUCT children, repeated+required, and enum sorted-order checks all addressed. make_null_column_with_schema correctly recurses.
src/main/cpp/src/protobuf/protobuf.hpp	Public C++ API: clean enum/struct definitions, MAX_FIELD_NUMBER/MAX_NESTING_DEPTH constants, and forward declarations. No issues found.
src/main/cpp/src/protobuf/protobuf_builders.cu	Builder utilities: make_null_column, make_empty_column_safe, make_null_list_column_with_child, make_empty_list_column all look correct. make_lists_column signature confirmed not to accept stream/mr.
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSchemaDescriptor.java	Immutable schema descriptor with defensive deep copies, full validation mirroring C++, and re-validation on deserialization. All invariants (repeated+required, sorted enum values, etc.) correctly enforced.
src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java	Clean public Java API with null checks and correct native method signature matching the JNI implementation.
src/main/cpp/src/protobuf/protobuf_kernels.cuh	CUDA kernel infrastructure: LocationProviders, extract_varint_kernel, and related templates. Kernel code is forward-looking infrastructure for follow-up PRs.
src/main/cpp/src/protobuf/protobuf_types.cuh	Device-side type definitions including field_location, field_descriptor, repeated_field_info, and device_nested_field_descriptor. Clean and complete.
src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java	13 output-shape and null-semantic tests including zero-row repeated scalar (testZeroRowRepeatedScalarShape) and nested struct child assertions (testNestedMessageOutputShape). Previous coverage gaps addressed.
src/test/java/com/nvidia/spark/rapids/jni/ProtobufSchemaDescriptorTest.java	13 schema validation tests covering repeated+default rejection, repeated+required rejection, enum metadata requirements, duplicate field numbers, and serialization roundtrip.
src/main/cpp/src/protobuf/protobuf_host_helpers.hpp	Host-side helpers: build_lookup_table, find_child_field_indices, make_empty_struct_column_with_schema forward declarations. Clean template design.
src/main/cpp/src/protobuf/protobuf_device_helpers.cuh	Device helper functions (read_varint, skip_field, decode_tag, set_error_once). Infrastructure for follow-up PRs; current PR only validates the headers compile.
src/main/cpp/src/protobuf/protobuf_kernels.cu	Kernel translation unit — primarily a compilation unit for protobuf_kernels.cuh infrastructure used by follow-up PRs.
src/main/cpp/CMakeLists.txt	Adds protobuf source files to the build. Minimal diff, no issues.

Sequence Diagram

sequenceDiagram
    participant Java as Protobuf.java
    participant JNI as ProtobufJni.cpp
    participant Validate as validate_decode_context()
    participant Stub as decode_protobuf_to_struct()
    participant Builders as protobuf_builders.cu

    Java->>JNI: decodeToStruct(binaryInputView, 15 arrays...)
    JNI->>JNI: null-check all inputs
    JNI->>JNI: validate array lengths match num_fields
    JNI->>JNI: build nested_field_descriptor[] + context
    JNI->>Stub: decode_protobuf_to_struct(input, context, stream, mr)
    Stub->>Validate: validate_decode_context(context)
    Validate-->>Stub: OK / throws invalid_argument
    alt num_fields == 0
        Stub->>Stub: return empty STRUCT<> with input null mask
    else num_rows == 0
        Stub->>Builders: make_empty_struct/list/column_safe per top-level field
        Builders-->>Stub: typed empty columns
        Stub->>Stub: return STRUCT<...> with 0 rows
    else normal path (stub)
        Stub->>Builders: make_null_column_with_schema per top-level field
        Builders->>Builders: recurse into STRUCT children / wrap repeated in LIST
        Builders-->>Stub: all-null columns with correct type hierarchy
        Stub->>Stub: propagate input null mask → return STRUCT<...>
    end
    Stub-->>JNI: unique_ptr<cudf::column>
    JNI-->>Java: jlong handle → ColumnVector

_{Reviews (22): Last reviewed commit: "Apply suggestions from code review" | Re-trigger Greptile}

src/main/cpp/src/protobuf.cu

src/main/cpp/src/protobuf.hpp

src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java

src/main/cpp/src/protobuf.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java

src/main/cpp/src/ProtobufJni.cpp

src/main/cpp/src/protobuf/protobuf_builders.cu

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

src/main/cpp/src/protobuf_common.cuh

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia

Can you put the new files (except *Jni.cpp file) into a separate folder please? Such as src/protobuf/*. This way can organize code by modules and provide a better structural view of the project. We should do the same to reorganize other modules soon.

ttnghia · 2026-03-17T17:46:28Z

src/main/cpp/src/protobuf.hpp

+namespace spark_rapids_jni {
+
+// Encoding constants
+enum class proto_encoding : int {


For better encapsulation, please wrap all the protobuf code into a sub namespace namespace protobuf. By doing so, we can easier identify the module of code. For example: protobuf::proto_encoding etc. It will make the code less confusing and cleaner.

ttnghia · 2026-03-17T17:48:41Z

src/main/cpp/src/protobuf.hpp

+std::unique_ptr<cudf::column> decode_protobuf_to_struct(cudf::column_view const& binary_input,
+                                                        ProtobufDecodeContext const& context,
+                                                        rmm::cuda_stream_view stream);


Lack of memory resources parameter.

ttnghia · 2026-03-17T17:50:29Z

src/main/cpp/src/protobuf.hpp

+  }
+}
+
+inline void validate_decode_context(ProtobufDecodeContext const& context)


Do not inline function in public header. Put them in a source file, only leave the declaration here.

Moved to protobuf.cu

ttnghia · 2026-03-17T17:51:10Z

src/main/cpp/src/protobuf.hpp

+    throw std::invalid_argument(std::string("protobuf decode context: ") + name +
+                                " size mismatch with schema (" + std::to_string(actual) + " vs " +
+                                std::to_string(num_fields) + ")");


Use CUDF_EXPECTS(,..., std::invalid_argument) for error check;

src/main/cpp/src/protobuf.hpp

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

src/main/cpp/src/protobuf/protobuf.cu

src/main/cpp/src/protobuf/protobuf_kernels.cu

ttnghia · 2026-03-20T21:39:10Z

src/main/cpp/src/protobuf/protobuf.hpp

+
+struct ProtobufFieldMetaView {
+  nested_field_descriptor const& schema;
+  cudf::data_type const& output_type;


Dangling reference: This is binding to a temporary at protobuf.cu:225.

Suggested change

cudf::data_type const& output_type;

cudf::data_type const output_type;

src/main/cpp/src/protobuf/protobuf.hpp

src/main/cpp/src/protobuf/protobuf_kernels.cuh

ttnghia · 2026-03-20T21:45:47Z

src/main/cpp/src/protobuf/protobuf_host_helpers.hpp

+ * given an array of schema indices and the schema itself.
+ * Returns an empty vector if the max field number exceeds the threshold.
+ */
+inline std::vector<int> build_index_lookup_table(nested_field_descriptor const* schema,


build_index_lookup_table and build_field_lookup_table have identical algorithm bodies differing only in field-number accessor.

Consider extract their content into a shared function.

Nice catch done.

ttnghia · 2026-03-20T21:46:57Z

src/main/cpp/src/protobuf/protobuf_types.cuh

+constexpr int ERR_REPEATED_COUNT_MISMATCH = 12;
+
+// Maximum supported nesting depth for recursive struct decoding.
+constexpr int MAX_NESTED_STRUCT_DECODE_DEPTH = 10;


This seems duplicate of MAX_NESTING_DEPTH=10 in protobuf.hpp, right?

Oh, nice catch!

src/main/cpp/CMakeLists.txt

src/main/cpp/src/protobuf/protobuf.cu

src/main/cpp/src/protobuf/protobuf_device_helpers.cuh

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia

We are very close.

ttnghia · 2026-03-23T18:12:01Z

src/main/cpp/src/ProtobufJni.cpp

+    // Convert default string values
+    std::vector<std::vector<uint8_t>> default_string_values;
+    default_string_values.reserve(num_fields);
+    for (int i = 0; i < num_fields; ++i) {
+      jbyteArray byte_arr = static_cast<jbyteArray>(env->GetObjectArrayElement(default_strings, i));
+      if (env->ExceptionCheck()) { return 0; }
+      if (byte_arr == nullptr) {
+        default_string_values.emplace_back();
+      } else {
+        jsize len    = env->GetArrayLength(byte_arr);
+        jbyte* bytes = env->GetByteArrayElements(byte_arr, nullptr);
+        if (bytes == nullptr) {
+          env->DeleteLocalRef(byte_arr);
+          return 0;
+        }
+        default_string_values.emplace_back(reinterpret_cast<uint8_t*>(bytes),
+                                           reinterpret_cast<uint8_t*>(bytes) + len);
+        env->ReleaseByteArrayElements(byte_arr, bytes, JNI_ABORT);
+        env->DeleteLocalRef(byte_arr);
+      }
+    }
+
+    // Convert enum valid values
+    std::vector<std::vector<int32_t>> enum_values;
+    enum_values.reserve(num_fields);
+    for (int i = 0; i < num_fields; ++i) {
+      jintArray int_arr = static_cast<jintArray>(env->GetObjectArrayElement(enum_valid_values, i));
+      if (env->ExceptionCheck()) { return 0; }
+      if (int_arr == nullptr) {
+        enum_values.emplace_back();
+      } else {
+        jsize len  = env->GetArrayLength(int_arr);
+        jint* ints = env->GetIntArrayElements(int_arr, nullptr);
+        if (ints == nullptr) {
+          env->DeleteLocalRef(int_arr);
+          return 0;
+        }
+        enum_values.emplace_back(ints, ints + len);
+        env->ReleaseIntArrayElements(int_arr, ints, JNI_ABORT);
+        env->DeleteLocalRef(int_arr);
+      }
+    }


Nit: This byte-array-to-vector conversion pattern is repeated almost exactly for default_string_values and enum_values. Can we extract common function for both?

ttnghia · 2026-03-23T18:20:18Z

src/main/cpp/src/protobuf/protobuf_host_helpers.hpp

+#include <thrust/fill.h>
+#include <thrust/for_each.h>
+#include <thrust/iterator/counting_iterator.h>
+#include <thrust/remove.h>
+#include <thrust/scan.h>
+#include <thrust/sort.h>
+#include <thrust/transform.h>
+#include <thrust/unique.h>


This seems still a CUDA header with lot of device code and kernel, thus I don't see what's difference from the protobuf_device_helpers.cuh?

Good point. I moved those device code and kernels to protobuf_kernels.

ttnghia · 2026-03-23T18:24:29Z

src/main/cpp/src/protobuf/protobuf_host_helpers.hpp

+    CUDF_CUDA_TRY(cudaMemcpyAsync(
+      d_default.data(), default_bytes.data(), def_len, cudaMemcpyHostToDevice, stream.value()));


Use memcpy from pageable memory is known to have limitations. Can we pass default_bytes as a pinned memory vector to this function instead?

It looks to be a quite large refactor but I think it's done.

ttnghia · 2026-03-23T18:25:01Z

src/main/cpp/src/protobuf/protobuf_kernels.cu

+
+  auto const blocks = static_cast<int>((num_items + THREADS_PER_BLOCK - 1u) / THREADS_PER_BLOCK);
+  rmm::device_uvector<int32_t> d_valid_enums(valid_enums.size(), stream, mr);
+  CUDF_CUDA_TRY(cudaMemcpyAsync(d_valid_enums.data(),


Copy from pageable memory. Can we pass valid_enums as pinned memory vector instead?

Yes. I did it along with the default_bytes change.

src/main/cpp/src/protobuf/protobuf_kernels.cu

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia · 2026-04-01T17:43:04Z

src/main/cpp/src/protobuf/protobuf.cu

+
+namespace spark_rapids_jni::protobuf {
+
+using namespace detail;


Why is this needed?

Changed to the namespace detail { way.

ttnghia · 2026-04-01T17:45:41Z

src/main/cpp/src/protobuf/protobuf.cu

+  CUDF_EXPECTS(context.default_ints.size() == num_fields,
+               "protobuf decode context: default_ints size mismatch with schema (" +
+                 std::to_string(context.default_ints.size()) + " vs " + std::to_string(num_fields) +
+                 ")",
+               std::invalid_argument);
+  CUDF_EXPECTS(context.default_floats.size() == num_fields,
+               "protobuf decode context: default_floats size mismatch with schema (" +
+                 std::to_string(context.default_floats.size()) + " vs " +
+                 std::to_string(num_fields) + ")",
+               std::invalid_argument);
+  CUDF_EXPECTS(context.default_bools.size() == num_fields,
+               "protobuf decode context: default_bools size mismatch with schema (" +
+                 std::to_string(context.default_bools.size()) + " vs " +
+                 std::to_string(num_fields) + ")",
+               std::invalid_argument);
+  CUDF_EXPECTS(context.default_strings.size() == num_fields,
+               "protobuf decode context: default_strings size mismatch with schema (" +
+                 std::to_string(context.default_strings.size()) + " vs " +
+                 std::to_string(num_fields) + ")",
+               std::invalid_argument);
+  CUDF_EXPECTS(context.enum_valid_values.size() == num_fields,
+               "protobuf decode context: enum_valid_values size mismatch with schema (" +
+                 std::to_string(context.enum_valid_values.size()) + " vs " +
+                 std::to_string(num_fields) + ")",
+               std::invalid_argument);
+  CUDF_EXPECTS(context.enum_names.size() == num_fields,
+               "protobuf decode context: enum_names size mismatch with schema (" +
+                 std::to_string(context.enum_names.size()) + " vs " + std::to_string(num_fields) +
+                 ")",
+               std::invalid_argument);


These calls would be expensive, as the strings are always concatenated to form the error messages unconditionally. The better approach, which only constructs the error message if failed, is:

if (cond) { CUDF_FAIL(<err_msg>); }

Thanks, done.

ttnghia · 2026-04-01T17:47:03Z

src/main/cpp/src/protobuf/protobuf.cu

+                 ")",
+               std::invalid_argument);
+
+  std::set<std::pair<int, int>> seen_field_numbers;


Use unsorted hash table is better, since each insertion to the sorted table is more expensive.

Suggested change

std::set<std::pair<int, int>> seen_field_numbers;

#include <unordered_set>

std::unordered_set<std::pair<int, int>> seen_field_numbers;

Done. Now using std::unordered_set<uint64_t> seen_field_numbers; because pair do not have a default hash function.

I used to think set is basically faster than unordered_set for small size of data because of the big constant in O(1) and possible collision for hash table. It turns out I was wrong. Thanks!

ttnghia · 2026-04-01T17:55:46Z

src/main/cpp/src/protobuf/protobuf_kernels.cu

+  if (num_rows == 0 || field_indices.empty()) { return; }
+
+  bool has_required  = false;
+  auto h_is_required = cudf::detail::make_host_vector<uint8_t>(field_indices.size(), stream);


We need make_pinned_vector specifically, do not use make_host_vector. make_host_vector is a generic factory function which can allocate from either pinned pool or pageable (just normal host memory) based on an internal threshold value. By default, the threshold to use pinned pool in cudf is 0, which mean just allocate from pageable memory, which mean the output host_vector is no different from std::vector.

Suggested change

auto h_is_required = cudf::detail::make_host_vector<uint8_t>(field_indices.size(), stream);

auto h_is_required = cudf::detail::make_pinned_vector<uint8_t>(field_indices.size(), stream);

Note: This is applied to all places using make_host_vector.

Thanks, updated.

ttnghia · 2026-04-01T18:01:18Z

src/main/cpp/src/protobuf/protobuf_kernels.cu

+  rmm::device_uvector<int32_t> invalid_rows(num_items, stream, mr);
+  thrust::transform(rmm::exec_policy_nosync(stream),
+                    thrust::make_counting_iterator(0),
+                    thrust::make_counting_iterator(num_items),
+                    invalid_rows.begin(),
+                    [item_invalid = item_invalid.data(), top_row_indices] __device__(int idx) {
+                      return item_invalid[idx] ? top_row_indices[idx] : -1;
+                    });
+
+  auto valid_end =
+    thrust::remove(rmm::exec_policy_nosync(stream), invalid_rows.begin(), invalid_rows.end(), -1);
+  thrust::sort(rmm::exec_policy_nosync(stream), invalid_rows.begin(), valid_end);
+  auto unique_end =
+    thrust::unique(rmm::exec_policy_nosync(stream), invalid_rows.begin(), valid_end);
+  thrust::for_each(rmm::exec_policy_nosync(stream),
+                   invalid_rows.begin(),
+                   unique_end,
+                   [row_invalid = row_invalid.data()] __device__(int32_t row_idx) {
+                     row_invalid[row_idx] = true;
+                   });


This seems too overkilled (it uses transform→remove→sort→unique→for_each (5 passes) when a single scatter suffices. Bool writes are idempotent — no dedup needed):

#include <cuda/atomic> - rmm::device_uvector<int32_t> invalid_rows(num_items, stream, mr); - thrust::transform(...); // pass 1 - thrust::remove(...); // pass 2 - thrust::sort(...); // pass 3 - thrust::unique(...); // pass 4 - thrust::for_each(...); // pass 5 thrust::for_each(rmm::exec_policy_nosync(stream), thrust::make_counting_iterator(0), thrust::make_counting_iterator(num_items), [item_invalid, top_row_indices, row_invalid] __device__(int idx) { if (item_invalid[idx]) { cuda::atomic_ref<bool, cuda::thread_scope_device> ref(row_invalid[top_row_indices[idx]]); ref.store(true, cuda::memory_order_relaxed); } });

👍 nice catch

src/main/cpp/src/protobuf/protobuf_kernels.cu

ttnghia · 2026-04-01T18:05:36Z

src/main/cpp/src/protobuf/protobuf_kernels.cu

+                                 int32_t const* top_row_indices,
+                                 int* error_flag,
+                                 rmm::cuda_stream_view stream,
+                                 rmm::device_async_resource_ref mr)


Similarly, there is no return memory thus mr seems redundant.

ttnghia · 2026-04-01T18:18:32Z

src/main/cpp/src/protobuf/protobuf_kernels.cuh

+  }
+}
+template <typename LocationProvider>
+CUDF_KERNEL void copy_varlen_data_kernel(uint8_t const* message_data,


Does this kernel do copying on medium/large size strings? If so, generating the arrays src_ptr, dst_ptr, sizes using a thrust::for_each kernel then using cub batch memcpy would be much faster.

Yes it might happen. Updated as you suggested.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia · 2026-04-03T03:58:44Z

src/main/cpp/src/protobuf/protobuf_kernels.cu

+  if (!has_required) { return; }
+
+  auto d_is_required = cudf::detail::make_device_uvector_async(
+    h_is_required, stream, rmm::mr::get_current_device_resource());


The lines 89, 151, 153 and protobuf_kernels.cuh (line 570): our convention mandates usage cudf::get_current_device_resource_ref().

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

yinqingh · 2026-04-03T05:09:18Z

Please disregard the pre-commit.ci failure. It's introduced from #4430 and I just disabled pre-commit.ci for the JNI repo and will re-enable once 4430 gets merged. Sorry for the confusion caused.

Correction: the correct PR link should be #4430, not 14525

liurenjie1024 · 2026-04-03T05:15:35Z

Please disregard the pre-commit.ci failure. It's introduced from NVIDIA/spark-rapids#14525 and I just disabled pre-commit.ci for the JNI repo and will re-enable once 14525 gets merged. Sorry for the confusion caused.

Why premerge ci failure is introduced by [NVIDIA/spark-rapids#14525]?

yinqingh · 2026-04-03T05:31:58Z

Please disregard the pre-commit.ci failure. It's introduced from NVIDIA/spark-rapids#14525 and I just disabled pre-commit.ci for the JNI repo and will re-enable once 14525 gets merged. Sorry for the confusion caused.

Why premerge ci failure is introduced by [NVIDIA/spark-rapids#14525]?

Sorry, I pasted the wrong PR link. It should be #4430

src/main/cpp/src/protobuf/protobuf_kernels.cu

src/main/cpp/src/protobuf/protobuf_kernels.cuh

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

ttnghia · 2026-04-03T20:14:41Z

build

Resolve conflicts from part0 (PR NVIDIA#4373) merge. Take main's reviewed conventions: detail namespace, snake_case, exec_policy_nosync, cub::DeviceMemcpy, pinned vectors, no-brace style. Keep dev branch's full implementation code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

First PR in a series adding a GPU-accelerated protobuf decoder to cuIO. Establishes the public API, schema validation, JNI bridge, Python/Cython bindings, and a stub decode_protobuf() entry point. The stub returns correctly-typed all-null columns for each schema field; actual data extraction is added in follow-up PRs. Includes: - C++ public API (protobuf.hpp): decode_protobuf_options, nested_field_descriptor, typed proto_encoding/proto_wire_type enums, validate_decode_options() - Shared CUDA infrastructure (device_helpers, host_helpers, kernels, types) - Java API (ProtobufSchemaDescriptor) and JNI bridge (ProtobufJni.cpp) - Python/Cython bindings (pylibcudf.io.protobuf) - 7 C++ tests covering output shape, type structure, and null propagation Migrated from NVIDIA/spark-rapids-jni#4373. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thirtiseven self-assigned this Mar 16, 2026

thirtiseven changed the title ~~from_protocol kernel (part0): framework code for protocol buffer decoder~~ from_protocol kernel (part0): framework, API, and schema validation Mar 16, 2026

thirtiseven changed the title ~~from_protocol kernel (part0): framework, API, and schema validation~~ from_protobuf kernel (part0): framework, API, and schema validation Mar 16, 2026

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

src/main/cpp/src/protobuf.cu Outdated Show resolved Hide resolved

src/main/cpp/src/protobuf.hpp Outdated Show resolved Hide resolved

src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java Show resolved Hide resolved

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

src/main/cpp/src/protobuf.cu Outdated Show resolved Hide resolved

src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java Show resolved Hide resolved

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

src/main/cpp/src/ProtobufJni.cpp Outdated Show resolved Hide resolved

thirtiseven changed the base branch from main to release/26.04 March 17, 2026 08:03

thirtiseven requested a review from a team as a code owner March 17, 2026 08:03

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

src/main/cpp/src/protobuf/protobuf_builders.cu Show resolved Hide resolved

thirtiseven added 6 commits March 17, 2026 16:13

framework code for protocol buffer decoder kernels

3c25016

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

style

3a47f5a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

5800387

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

6992e9a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comment

0f76877

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

port enum refactor

b619524

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven force-pushed the protobuf_pr0_framework branch from 9e440d7 to b619524 Compare March 17, 2026 08:14

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

src/main/cpp/src/protobuf_common.cuh Outdated Show resolved Hide resolved

thirtiseven added 4 commits March 17, 2026 16:22

address comments

d6997db

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

c1fed48

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

d25fba1

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

fix compile

0e82ca8

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from res-life March 17, 2026 09:18

ttnghia reviewed Mar 17, 2026

View reviewed changes

src/main/cpp/src/protobuf.hpp Outdated Show resolved Hide resolved

ttnghia reviewed Mar 17, 2026

View reviewed changes

src/main/cpp/src/protobuf.hpp Outdated Show resolved Hide resolved

thirtiseven added 2 commits March 20, 2026 18:10

address comments

2caf5d8

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

cudf

e0a990c

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia reviewed Mar 20, 2026

View reviewed changes

address comments

d8b3ede

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia reviewed Mar 23, 2026

View reviewed changes

address coemments, use pinned memory

b56aef4

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven changed the base branch from release/26.04 to main March 30, 2026 02:07

thirtiseven requested a review from ttnghia March 31, 2026 01:43

ttnghia requested changes Apr 1, 2026

View reviewed changes

thirtiseven added 2 commits April 2, 2026 14:21

Merge remote-tracking branch 'origin/main' into protobuf_pr0_framework

ba011f7

address comments

456da81

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia reviewed Apr 3, 2026

View reviewed changes

thirtiseven added 2 commits April 3, 2026 12:17

using get_current_device_resource_ref

e748ea7

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ci

1ab75a7

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

ttnghia reviewed Apr 3, 2026

View reviewed changes

ttnghia added 4 commits April 3, 2026 13:11

Update src/main/cpp/src/protobuf/protobuf_kernels.cu

176dfba

Update src/main/cpp/src/protobuf/protobuf_kernels.cu

1f7f4ff

Update src/main/cpp/src/protobuf/protobuf_kernels.cu

a8cd81c

Update src/main/cpp/src/protobuf/protobuf_kernels.cuh

aa9c1ca

ttnghia reviewed Apr 3, 2026

View reviewed changes

src/main/cpp/src/protobuf/protobuf_kernels.cuh Show resolved Hide resolved

Apply suggestions from code review

0cc87ee

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

ttnghia approved these changes Apr 3, 2026

View reviewed changes

thirtiseven merged commit 59fb954 into NVIDIA:main Apr 4, 2026
5 checks passed

thirtiseven mentioned this pull request Apr 9, 2026

Add GPU protobuf decoder to cuIO [part 0]: framework, API, and stub decode rapidsai/cudf#22077

Open

3 tasks

	cudf::data_type const& output_type;
	cudf::data_type const output_type;

		CUDF_CUDA_TRY(cudaMemcpyAsync(
		d_default.data(), default_bytes.data(), def_len, cudaMemcpyHostToDevice, stream.value()));


		namespace spark_rapids_jni::protobuf {

		using namespace detail;

-  std::set<std::pair<int, int>> seen_field_numbers;
+#include <unordered_set>
+  std::unordered_set<std::pair<int, int>> seen_field_numbers;

	auto h_is_required = cudf::detail::make_host_vector<uint8_t>(field_indices.size(), stream);
	auto h_is_required = cudf::detail::make_pinned_vector<uint8_t>(field_indices.size(), stream);

Conversation

thirtiseven commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add GPU protobuf decoder: framework, API, and schema validation [1/4]

Summary

What is included

Test coverage (26 tests)

Follow-up PRs

Review guide

Uh oh!

greptile-apps bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ttnghia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttnghia Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ttnghia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thirtiseven commented Mar 16, 2026 •

edited

Loading

greptile-apps bot commented Mar 16, 2026 •

edited

Loading

ttnghia Mar 17, 2026 •

edited

Loading