Add GPU protobuf decoder to cuIO [part 0]: framework, API, and stub decode#22077
Open
thirtiseven wants to merge 4 commits intorapidsai:mainfrom
Open
Add GPU protobuf decoder to cuIO [part 0]: framework, API, and stub decode#22077thirtiseven wants to merge 4 commits intorapidsai:mainfrom
thirtiseven wants to merge 4 commits intorapidsai:mainfrom
Conversation
First PR in a series adding a GPU-accelerated protobuf decoder to cuIO. Establishes the public API, schema validation, JNI bridge, Python/Cython bindings, and a stub decode_protobuf() entry point. The stub returns correctly-typed all-null columns for each schema field; actual data extraction is added in follow-up PRs. Includes: - C++ public API (protobuf.hpp): decode_protobuf_options, nested_field_descriptor, typed proto_encoding/proto_wire_type enums, validate_decode_options() - Shared CUDA infrastructure (device_helpers, host_helpers, kernels, types) - Java API (ProtobufSchemaDescriptor) and JNI bridge (ProtobufJni.cpp) - Python/Cython bindings (pylibcudf.io.protobuf) - 7 C++ tests covering output shape, type structure, and null propagation Migrated from NVIDIA/spark-rapids-jni#4373. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add
cudf::io::protobuf::decode_protobuf()that takes aLIST<UINT8>column of serialized protobuf messages and decodes them into aSTRUCTcolumn.This is a part 0 (framework) PR — the stub currently returns correctly-typed all-null columns. Actual data extraction will be added in follow-up PRs.
Background
We have an ongoing effort in spark-rapids-jni building a full GPU protobuf decoder for Spark's
from_protobuf. The kernel code is functional and under review there (tracking PR: NVIDIA/spark-rapids-jni#4107). We'd like to explore contributing the core decode logic to cudf so it can serve both Spark-RAPIDS and other consumers (cudf-polars, Python users).This PR mirrors the already-merged NVIDIA/spark-rapids-jni#4373 (part 0), adapted to cudf conventions.
What's included
cudf/io/protobuf.hpp):decode_protobuf_options,nested_field_descriptor, typed enums,decode_protobuf()ColumnView.decodeProtobuf()+ProtobufSchemaDescriptorpylibcudf.io.protobuf.decode_protobuf()Checklist