Skip to content

from_protobuf kernel (part0): framework, API, and schema validation #4373

Merged
thirtiseven merged 30 commits intoNVIDIA:mainfrom
thirtiseven:protobuf_pr0_framework
Apr 4, 2026
Merged

from_protobuf kernel (part0): framework, API, and schema validation #4373
thirtiseven merged 30 commits intoNVIDIA:mainfrom
thirtiseven:protobuf_pr0_framework

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

@thirtiseven thirtiseven commented Mar 16, 2026

Add GPU protobuf decoder: framework, API, and schema validation [1/4]

Summary

First PR in a four-part series adding a GPU-accelerated protobuf decoder to spark-rapids-jni. This PR establishes the public API, schema validation, JNI bridge, shared CUDA infrastructure, and a stub decode_protobuf_to_struct entry point.

The decoder converts LIST<INT8/UINT8> columns (one serialized protobuf message per row) into nested cuDF STRUCT columns. The stub in this PR returns correctly-typed all-null columns for each schema field; actual data extraction is added in follow-up PRs.

What is included

  • C++ public API (protobuf.hpp): nested_field_descriptor, ProtobufDecodeContext, typed proto_encoding / proto_wire_type enums, validate_decode_context() with comprehensive schema invariant checks (field numbers, parent-child topology, depth, wire type / encoding / output type compatibility, repeated && required rejection, sorted enum_valid_values).

  • Java schema API (Protobuf.java + ProtobufSchemaDescriptor.java): immutable schema descriptor with defensive deep copies, full validation mirroring C++, Serializable with re-validation on deserialization. Public decodeToStruct() method with PERMISSIVE / FAILFAST mode support.

  • JNI bridge (ProtobufJni.cpp): converts 15 Java arrays to C++ ProtobufDecodeContext with null checks, field number range validation, wire type validation, and proper DeleteLocalRef for all JNI local references including triple-nested enum_names.

  • Shared CUDA header (protobuf_common.cuh): types, device helpers (read_varint, skip_field, decode_tag), LocationProvider structs, template extraction kernels, and forward declarations. Included in full so that follow-up PRs do not need to modify this header — eliminating merge conflicts across the PR chain.

  • Stub decode (protobuf.cu): validates context, handles empty-schema and zero-row edge cases with correct nested type construction, propagates input null mask to output struct, and assembles a STRUCT with all-null children via recursive make_null_column_with_schema (respects nested STRUCT children and repeated LIST wrapping).

  • Column utilities (protobuf_builders.cu): make_null_column (all types), make_empty_column_safe, make_null_list_column_with_child, make_empty_list_column.

Test coverage (26 tests)

  • Schema validation (13 tests): repeated+default rejection, repeated+required rejection, struct/list default rejection, enum metadata requirements, duplicate field numbers, child-parent constraints, encoding compatibility, depth limit, serialization roundtrip.

  • Output shape and null semantics (13 tests): empty schema, single/multi-field schemas, multiple rows, null input row propagation (verified with isNull assertions), all-null input, zero-row with flat/nested/repeated schemas (including grandchild type verification), repeated scalar LIST wrapping, input validation.

Follow-up PRs

  1. This PR — Framework + API + stub decode
  2. Scalar extractionscan_all_fields_kernel, batched extraction, all scalar types, defaults, required fields
  3. Repeated + nested — count/scan/build pipeline, recursive nested decode up to depth 10
  4. Enum-as-string + PERMISSIVE — enum validation, varint-to-UTF8 conversion, null propagation

Each follow-up inserts decode logic into the column_map section of protobuf.cu without modifying existing code. protobuf_common.cuh is frozen after this PR.

Review guide

  1. protobuf.hpp — Start here. The API contract: nested_field_descriptor, ProtobufDecodeContext, validate_decode_context().
  2. ProtobufSchemaDescriptor.java — Java-side validation mirror. Check defensive copies and deserialization re-validation.
  3. ProtobufJni.cpp — Focus on local reference cleanup in the enum_names triple-nested loop.
  4. protobuf.cu — Stub decode. Verify: empty-schema returns num_rows with 0 children; zero-row builds correct nested types; null mask is propagated from input; make_null_column_with_schema recursively builds STRUCT children and LIST wrapping.
  5. protobuf_common.cuh — For this PR, focus on types and make_empty_struct_column_with_schema / find_child_field_indices. The rest (device helpers, template kernels, LocationProviders) is infrastructure for follow-up PRs.
  6. TestsProtobufSchemaDescriptorTest for validation, ProtobufTest for output shape and null semantics.

@thirtiseven thirtiseven self-assigned this Mar 16, 2026
@thirtiseven thirtiseven changed the title from_protocol kernel (part0): framework code for protocol buffer decoder from_protocol kernel (part0): framework, API, and schema validation Mar 16, 2026
@thirtiseven thirtiseven changed the title from_protocol kernel (part0): framework, API, and schema validation from_protobuf kernel (part0): framework, API, and schema validation Mar 16, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR establishes the full framework for a GPU-accelerated protobuf decoder: public C++ API (protobuf.hpp), an immutable Java schema descriptor with deep-copy and re-validation on deserialization, a JNI bridge, shared CUDA infrastructure headers, and a stub decode_protobuf_to_struct that returns correctly-typed all-null columns. All structural issues raised in prior rounds (zero-row repeated scalar shape, nested STRUCT child construction, repeated && required and sorted-enum-values validation) have been addressed and are covered by the expanded test suite.

Confidence Score: 5/5

Safe to merge; all prior P0/P1 issues are resolved and the one remaining finding is a minor P2 resource-safety note in JNI helpers.

The major structural bugs from earlier review rounds (wrong column type for zero-row repeated scalars, zero-child STRUCT assembly, missing repeated+required and enum sort checks) are all fixed and tested. The only new finding is a low-probability resource leak in jni_byte_array_to_vector / jni_int_array_to_vector if make_host_vector throws under OOM — a P2 hardening item that does not block merge.

src/main/cpp/src/ProtobufJni.cpp — minor RAII hardening for JNI array element release.

Important Files Changed

Filename Overview
src/main/cpp/src/ProtobufJni.cpp JNI bridge: null checks, array-size validation, and local-ref cleanup are correct. Minor resource-leak risk in jni_byte_array_to_vector / jni_int_array_to_vector if make_host_vector throws.
src/main/cpp/src/protobuf/protobuf.cu Stub decode: zero-row repeated scalar/STRUCT, non-zero-row nested STRUCT children, repeated+required, and enum sorted-order checks all addressed. make_null_column_with_schema correctly recurses.
src/main/cpp/src/protobuf/protobuf.hpp Public C++ API: clean enum/struct definitions, MAX_FIELD_NUMBER/MAX_NESTING_DEPTH constants, and forward declarations. No issues found.
src/main/cpp/src/protobuf/protobuf_builders.cu Builder utilities: make_null_column, make_empty_column_safe, make_null_list_column_with_child, make_empty_list_column all look correct. make_lists_column signature confirmed not to accept stream/mr.
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSchemaDescriptor.java Immutable schema descriptor with defensive deep copies, full validation mirroring C++, and re-validation on deserialization. All invariants (repeated+required, sorted enum values, etc.) correctly enforced.
src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java Clean public Java API with null checks and correct native method signature matching the JNI implementation.
src/main/cpp/src/protobuf/protobuf_kernels.cuh CUDA kernel infrastructure: LocationProviders, extract_varint_kernel, and related templates. Kernel code is forward-looking infrastructure for follow-up PRs.
src/main/cpp/src/protobuf/protobuf_types.cuh Device-side type definitions including field_location, field_descriptor, repeated_field_info, and device_nested_field_descriptor. Clean and complete.
src/test/java/com/nvidia/spark/rapids/jni/ProtobufTest.java 13 output-shape and null-semantic tests including zero-row repeated scalar (testZeroRowRepeatedScalarShape) and nested struct child assertions (testNestedMessageOutputShape). Previous coverage gaps addressed.
src/test/java/com/nvidia/spark/rapids/jni/ProtobufSchemaDescriptorTest.java 13 schema validation tests covering repeated+default rejection, repeated+required rejection, enum metadata requirements, duplicate field numbers, and serialization roundtrip.
src/main/cpp/src/protobuf/protobuf_host_helpers.hpp Host-side helpers: build_lookup_table, find_child_field_indices, make_empty_struct_column_with_schema forward declarations. Clean template design.
src/main/cpp/src/protobuf/protobuf_device_helpers.cuh Device helper functions (read_varint, skip_field, decode_tag, set_error_once). Infrastructure for follow-up PRs; current PR only validates the headers compile.
src/main/cpp/src/protobuf/protobuf_kernels.cu Kernel translation unit — primarily a compilation unit for protobuf_kernels.cuh infrastructure used by follow-up PRs.
src/main/cpp/CMakeLists.txt Adds protobuf source files to the build. Minimal diff, no issues.

Sequence Diagram

sequenceDiagram
    participant Java as Protobuf.java
    participant JNI as ProtobufJni.cpp
    participant Validate as validate_decode_context()
    participant Stub as decode_protobuf_to_struct()
    participant Builders as protobuf_builders.cu

    Java->>JNI: decodeToStruct(binaryInputView, 15 arrays...)
    JNI->>JNI: null-check all inputs
    JNI->>JNI: validate array lengths match num_fields
    JNI->>JNI: build nested_field_descriptor[] + context
    JNI->>Stub: decode_protobuf_to_struct(input, context, stream, mr)
    Stub->>Validate: validate_decode_context(context)
    Validate-->>Stub: OK / throws invalid_argument
    alt num_fields == 0
        Stub->>Stub: return empty STRUCT<> with input null mask
    else num_rows == 0
        Stub->>Builders: make_empty_struct/list/column_safe per top-level field
        Builders-->>Stub: typed empty columns
        Stub->>Stub: return STRUCT<...> with 0 rows
    else normal path (stub)
        Stub->>Builders: make_null_column_with_schema per top-level field
        Builders->>Builders: recurse into STRUCT children / wrap repeated in LIST
        Builders-->>Stub: all-null columns with correct type hierarchy
        Stub->>Stub: propagate input null mask → return STRUCT<...>
    end
    Stub-->>JNI: unique_ptr<cudf::column>
    JNI-->>Java: jlong handle → ColumnVector
Loading

Reviews (22): Last reviewed commit: "Apply suggestions from code review" | Re-trigger Greptile

@thirtiseven thirtiseven changed the base branch from main to release/26.04 March 17, 2026 08:03
@thirtiseven thirtiseven requested a review from a team as a code owner March 17, 2026 08:03
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven force-pushed the protobuf_pr0_framework branch from 9e440d7 to b619524 Compare March 17, 2026 08:14
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven requested a review from res-life March 17, 2026 09:18
Copy link
Copy Markdown
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the new files (except *Jni.cpp file) into a separate folder please? Such as src/protobuf/*. This way can organize code by modules and provide a better structural view of the project. We should do the same to reorganize other modules soon.

Comment on lines +31 to +34
namespace spark_rapids_jni {

// Encoding constants
enum class proto_encoding : int {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better encapsulation, please wrap all the protobuf code into a sub namespace namespace protobuf. By doing so, we can easier identify the module of code. For example: protobuf::proto_encoding etc. It will make the code less confusing and cleaner.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +312 to +314
std::unique_ptr<cudf::column> decode_protobuf_to_struct(cudf::column_view const& binary_input,
ProtobufDecodeContext const& context,
rmm::cuda_stream_view stream);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of memory resources parameter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}
}

inline void validate_decode_context(ProtobufDecodeContext const& context)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not inline function in public header. Put them in a source file, only leave the declaration here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to protobuf.cu

Comment on lines +152 to +154
throw std::invalid_argument(std::string("protobuf decode context: ") + name +
" size mismatch with schema (" + std::to_string(actual) + " vs " +
std::to_string(num_fields) + ")");
Copy link
Copy Markdown
Collaborator

@ttnghia ttnghia Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CUDF_EXPECTS(,..., std::invalid_argument) for error check;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

struct ProtobufFieldMetaView {
nested_field_descriptor const& schema;
cudf::data_type const& output_type;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangling reference: This is binding to a temporary at protobuf.cu:225.

Suggested change
cudf::data_type const& output_type;
cudf::data_type const output_type;

* given an array of schema indices and the schema itself.
* Returns an empty vector if the max field number exceeds the threshold.
*/
inline std::vector<int> build_index_lookup_table(nested_field_descriptor const* schema,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_index_lookup_table and build_field_lookup_table have identical algorithm bodies differing only in field-number accessor.

Consider extract their content into a shared function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch done.

constexpr int ERR_REPEATED_COUNT_MISMATCH = 12;

// Maximum supported nesting depth for recursive struct decoding.
constexpr int MAX_NESTED_STRUCT_DECODE_DEPTH = 10;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems duplicate of MAX_NESTING_DEPTH=10 in protobuf.hpp, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice catch!

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Copy Markdown
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close.

Comment on lines +115 to +156
// Convert default string values
std::vector<std::vector<uint8_t>> default_string_values;
default_string_values.reserve(num_fields);
for (int i = 0; i < num_fields; ++i) {
jbyteArray byte_arr = static_cast<jbyteArray>(env->GetObjectArrayElement(default_strings, i));
if (env->ExceptionCheck()) { return 0; }
if (byte_arr == nullptr) {
default_string_values.emplace_back();
} else {
jsize len = env->GetArrayLength(byte_arr);
jbyte* bytes = env->GetByteArrayElements(byte_arr, nullptr);
if (bytes == nullptr) {
env->DeleteLocalRef(byte_arr);
return 0;
}
default_string_values.emplace_back(reinterpret_cast<uint8_t*>(bytes),
reinterpret_cast<uint8_t*>(bytes) + len);
env->ReleaseByteArrayElements(byte_arr, bytes, JNI_ABORT);
env->DeleteLocalRef(byte_arr);
}
}

// Convert enum valid values
std::vector<std::vector<int32_t>> enum_values;
enum_values.reserve(num_fields);
for (int i = 0; i < num_fields; ++i) {
jintArray int_arr = static_cast<jintArray>(env->GetObjectArrayElement(enum_valid_values, i));
if (env->ExceptionCheck()) { return 0; }
if (int_arr == nullptr) {
enum_values.emplace_back();
} else {
jsize len = env->GetArrayLength(int_arr);
jint* ints = env->GetIntArrayElements(int_arr, nullptr);
if (ints == nullptr) {
env->DeleteLocalRef(int_arr);
return 0;
}
enum_values.emplace_back(ints, ints + len);
env->ReleaseIntArrayElements(int_arr, ints, JNI_ABORT);
env->DeleteLocalRef(int_arr);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This byte-array-to-vector conversion pattern is repeated almost exactly for default_string_values and enum_values. Can we extract common function for both?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines +35 to +42
#include <thrust/fill.h>
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/remove.h>
#include <thrust/scan.h>
#include <thrust/sort.h>
#include <thrust/transform.h>
#include <thrust/unique.h>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems still a CUDA header with lot of device code and kernel, thus I don't see what's difference from the protobuf_device_helpers.cuh?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I moved those device code and kernels to protobuf_kernels.

Comment on lines +476 to +477
CUDF_CUDA_TRY(cudaMemcpyAsync(
d_default.data(), default_bytes.data(), def_len, cudaMemcpyHostToDevice, stream.value()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use memcpy from pageable memory is known to have limitations. Can we pass default_bytes as a pinned memory vector to this function instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to be a quite large refactor but I think it's done.


auto const blocks = static_cast<int>((num_items + THREADS_PER_BLOCK - 1u) / THREADS_PER_BLOCK);
rmm::device_uvector<int32_t> d_valid_enums(valid_enums.size(), stream, mr);
CUDF_CUDA_TRY(cudaMemcpyAsync(d_valid_enums.data(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy from pageable memory. Can we pass valid_enums as pinned memory vector instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I did it along with the default_bytes change.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven changed the base branch from release/26.04 to main March 30, 2026 02:07
@thirtiseven thirtiseven requested a review from ttnghia March 31, 2026 01:43

namespace spark_rapids_jni::protobuf {

using namespace detail;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to the namespace detail { way.

Comment on lines +109 to +138
CUDF_EXPECTS(context.default_ints.size() == num_fields,
"protobuf decode context: default_ints size mismatch with schema (" +
std::to_string(context.default_ints.size()) + " vs " + std::to_string(num_fields) +
")",
std::invalid_argument);
CUDF_EXPECTS(context.default_floats.size() == num_fields,
"protobuf decode context: default_floats size mismatch with schema (" +
std::to_string(context.default_floats.size()) + " vs " +
std::to_string(num_fields) + ")",
std::invalid_argument);
CUDF_EXPECTS(context.default_bools.size() == num_fields,
"protobuf decode context: default_bools size mismatch with schema (" +
std::to_string(context.default_bools.size()) + " vs " +
std::to_string(num_fields) + ")",
std::invalid_argument);
CUDF_EXPECTS(context.default_strings.size() == num_fields,
"protobuf decode context: default_strings size mismatch with schema (" +
std::to_string(context.default_strings.size()) + " vs " +
std::to_string(num_fields) + ")",
std::invalid_argument);
CUDF_EXPECTS(context.enum_valid_values.size() == num_fields,
"protobuf decode context: enum_valid_values size mismatch with schema (" +
std::to_string(context.enum_valid_values.size()) + " vs " +
std::to_string(num_fields) + ")",
std::invalid_argument);
CUDF_EXPECTS(context.enum_names.size() == num_fields,
"protobuf decode context: enum_names size mismatch with schema (" +
std::to_string(context.enum_names.size()) + " vs " + std::to_string(num_fields) +
")",
std::invalid_argument);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These calls would be expensive, as the strings are always concatenated to form the error messages unconditionally. The better approach, which only constructs the error message if failed, is:

if (cond) {
 CUDF_FAIL(<err_msg>);
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

")",
std::invalid_argument);

std::set<std::pair<int, int>> seen_field_numbers;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use unsorted hash table is better, since each insertion to the sorted table is more expensive.

Suggested change
std::set<std::pair<int, int>> seen_field_numbers;
#include <unordered_set>
std::unordered_set<std::pair<int, int>> seen_field_numbers;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now using std::unordered_set<uint64_t> seen_field_numbers; because pair do not have a default hash function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to think set is basically faster than unordered_set for small size of data because of the big constant in O(1) and possible collision for hash table. It turns out I was wrong. Thanks!

if (num_rows == 0 || field_indices.empty()) { return; }

bool has_required = false;
auto h_is_required = cudf::detail::make_host_vector<uint8_t>(field_indices.size(), stream);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need make_pinned_vector specifically, do not use make_host_vector. make_host_vector is a generic factory function which can allocate from either pinned pool or pageable (just normal host memory) based on an internal threshold value. By default, the threshold to use pinned pool in cudf is 0, which mean just allocate from pageable memory, which mean the output host_vector is no different from std::vector.

Suggested change
auto h_is_required = cudf::detail::make_host_vector<uint8_t>(field_indices.size(), stream);
auto h_is_required = cudf::detail::make_pinned_vector<uint8_t>(field_indices.size(), stream);

Note: This is applied to all places using make_host_vector.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Comment on lines +131 to +150
rmm::device_uvector<int32_t> invalid_rows(num_items, stream, mr);
thrust::transform(rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(num_items),
invalid_rows.begin(),
[item_invalid = item_invalid.data(), top_row_indices] __device__(int idx) {
return item_invalid[idx] ? top_row_indices[idx] : -1;
});

auto valid_end =
thrust::remove(rmm::exec_policy_nosync(stream), invalid_rows.begin(), invalid_rows.end(), -1);
thrust::sort(rmm::exec_policy_nosync(stream), invalid_rows.begin(), valid_end);
auto unique_end =
thrust::unique(rmm::exec_policy_nosync(stream), invalid_rows.begin(), valid_end);
thrust::for_each(rmm::exec_policy_nosync(stream),
invalid_rows.begin(),
unique_end,
[row_invalid = row_invalid.data()] __device__(int32_t row_idx) {
row_invalid[row_idx] = true;
});
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too overkilled (it uses transform→remove→sort→unique→for_each (5 passes) when a single scatter suffices. Bool writes are idempotent — no dedup needed):

#include <cuda/atomic>

 - rmm::device_uvector<int32_t> invalid_rows(num_items, stream, mr);
  - thrust::transform(...);  // pass 1
  - thrust::remove(...);     // pass 2
  - thrust::sort(...);       // pass 3
  - thrust::unique(...);     // pass 4
  - thrust::for_each(...);   // pass 5
thrust::for_each(rmm::exec_policy_nosync(stream),
   thrust::make_counting_iterator(0),
   thrust::make_counting_iterator(num_items),
   [item_invalid, top_row_indices, row_invalid] __device__(int idx) {
     if (item_invalid[idx]) {
       cuda::atomic_ref<bool, cuda::thread_scope_device> ref(row_invalid[top_row_indices[idx]]);
       ref.store(true, cuda::memory_order_relaxed);
     }
   });

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice catch

int32_t const* top_row_indices,
int* error_flag,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, there is no return memory thus mr seems redundant.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

}
}
template <typename LocationProvider>
CUDF_KERNEL void copy_varlen_data_kernel(uint8_t const* message_data,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this kernel do copying on medium/large size strings? If so, generating the arrays src_ptr, dst_ptr, sizes using a thrust::for_each kernel then using cub batch memcpy would be much faster.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it might happen. Updated as you suggested.

if (!has_required) { return; }

auto d_is_required = cudf::detail::make_device_uvector_async(
h_is_required, stream, rmm::mr::get_current_device_resource());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines 89, 151, 153 and protobuf_kernels.cuh (line 570): our convention mandates usage cudf::get_current_device_resource_ref().

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@yinqingh
Copy link
Copy Markdown
Collaborator

yinqingh commented Apr 3, 2026

Please disregard the pre-commit.ci failure. It's introduced from #4430 and I just disabled pre-commit.ci for the JNI repo and will re-enable once 4430 gets merged. Sorry for the confusion caused.

Correction: the correct PR link should be #4430, not 14525

@liurenjie1024
Copy link
Copy Markdown
Collaborator

Please disregard the pre-commit.ci failure. It's introduced from NVIDIA/spark-rapids#14525 and I just disabled pre-commit.ci for the JNI repo and will re-enable once 14525 gets merged. Sorry for the confusion caused.

Why premerge ci failure is introduced by [NVIDIA/spark-rapids#14525]?

@yinqingh
Copy link
Copy Markdown
Collaborator

yinqingh commented Apr 3, 2026

Please disregard the pre-commit.ci failure. It's introduced from NVIDIA/spark-rapids#14525 and I just disabled pre-commit.ci for the JNI repo and will re-enable once 14525 gets merged. Sorry for the confusion caused.

Why premerge ci failure is introduced by [NVIDIA/spark-rapids#14525]?

Sorry, I pasted the wrong PR link. It should be #4430

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
@ttnghia
Copy link
Copy Markdown
Collaborator

ttnghia commented Apr 3, 2026

build

@thirtiseven thirtiseven merged commit 59fb954 into NVIDIA:main Apr 4, 2026
5 checks passed
thirtiseven added a commit to thirtiseven/spark-rapids-jni that referenced this pull request Apr 7, 2026
Resolve conflicts from part0 (PR NVIDIA#4373) merge.
Take main's reviewed conventions: detail namespace, snake_case,
exec_policy_nosync, cub::DeviceMemcpy, pinned vectors, no-brace style.
Keep dev branch's full implementation code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
thirtiseven added a commit to thirtiseven/cudf that referenced this pull request Apr 9, 2026
First PR in a series adding a GPU-accelerated protobuf decoder to cuIO.
Establishes the public API, schema validation, JNI bridge, Python/Cython
bindings, and a stub decode_protobuf() entry point.

The stub returns correctly-typed all-null columns for each schema field;
actual data extraction is added in follow-up PRs.

Includes:
- C++ public API (protobuf.hpp): decode_protobuf_options, nested_field_descriptor,
  typed proto_encoding/proto_wire_type enums, validate_decode_options()
- Shared CUDA infrastructure (device_helpers, host_helpers, kernels, types)
- Java API (ProtobufSchemaDescriptor) and JNI bridge (ProtobufJni.cpp)
- Python/Cython bindings (pylibcudf.io.protobuf)
- 7 C++ tests covering output shape, type structure, and null propagation

Migrated from NVIDIA/spark-rapids-jni#4373.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants