Skip to content

Add GPU protobuf decoder to cuIO [part 0]: framework, API, and stub decode#22077

Open
thirtiseven wants to merge 4 commits intorapidsai:mainfrom
thirtiseven:fea/protobuf-decode-part0
Open

Add GPU protobuf decoder to cuIO [part 0]: framework, API, and stub decode#22077
thirtiseven wants to merge 4 commits intorapidsai:mainfrom
thirtiseven:fea/protobuf-decode-part0

Conversation

@thirtiseven
Copy link
Copy Markdown
Contributor

@thirtiseven thirtiseven commented Apr 9, 2026

Description

Add cudf::io::protobuf::decode_protobuf() that takes a LIST<UINT8> column of serialized protobuf messages and decodes them into a STRUCT column.

This is a part 0 (framework) PR — the stub currently returns correctly-typed all-null columns. Actual data extraction will be added in follow-up PRs.

Background

We have an ongoing effort in spark-rapids-jni building a full GPU protobuf decoder for Spark's from_protobuf. The kernel code is functional and under review there (tracking PR: NVIDIA/spark-rapids-jni#4107). We'd like to explore contributing the core decode logic to cudf so it can serve both Spark-RAPIDS and other consumers (cudf-polars, Python users).

This PR mirrors the already-merged NVIDIA/spark-rapids-jni#4373 (part 0), adapted to cudf conventions.

What's included

  • C++ public API (cudf/io/protobuf.hpp): decode_protobuf_options, nested_field_descriptor, typed enums, decode_protobuf()
  • Schema validation: field number range, parent-child topology, depth limits, wire type / encoding / output type compatibility
  • Shared CUDA infrastructure: varint codec, field scanning helpers, tag decoding, location providers (headers included in full to avoid merge conflicts with follow-up PRs)
  • Stub decode: handles empty-schema, zero-row, and null-input edge cases with correct nested type construction; returns all-null children for non-empty inputs
  • JNI bindings: ColumnView.decodeProtobuf() + ProtobufSchemaDescriptor
  • Python/Cython bindings: pylibcudf.io.protobuf.decode_protobuf()
  • 7 C++ tests: output shape, type structure, nested/repeated schema, null propagation

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

thirtiseven and others added 2 commits April 9, 2026 14:17
First PR in a series adding a GPU-accelerated protobuf decoder to cuIO.
Establishes the public API, schema validation, JNI bridge, Python/Cython
bindings, and a stub decode_protobuf() entry point.

The stub returns correctly-typed all-null columns for each schema field;
actual data extraction is added in follow-up PRs.

Includes:
- C++ public API (protobuf.hpp): decode_protobuf_options, nested_field_descriptor,
  typed proto_encoding/proto_wire_type enums, validate_decode_options()
- Shared CUDA infrastructure (device_helpers, host_helpers, kernels, types)
- Java API (ProtobufSchemaDescriptor) and JNI bridge (ProtobufJni.cpp)
- Python/Cython bindings (pylibcudf.io.protobuf)
- 7 C++ tests covering output shape, type structure, and null propagation

Migrated from NVIDIA/spark-rapids-jni#4373.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven requested review from a team as code owners April 9, 2026 06:49
@thirtiseven thirtiseven requested review from Matt711 and wence- April 9, 2026 06:49
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@thirtiseven thirtiseven requested a review from vyasr April 9, 2026 06:49
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. pylibcudf Issues specific to the pylibcudf package labels Apr 9, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants