Skip to content

feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953

Open
kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
kadinrabo:feat/unsigned-extension-types
Open

feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953
kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
kadinrabo:feat/unsigned-extension-types

Conversation

@kadinrabo
Copy link
Copy Markdown
Contributor

@kadinrabo kadinrabo commented Jan 29, 2026

Description

Adds unsigned integer types (u8, u16, u32, u64) as first-class extension types with arithmetic function support and test coverage.

  • Self-contained unsigned_integers.yaml with type definitions (string structure encoding) and arithmetic function overloads (add, subtract, multiply, divide, modulus, sum, min, max)
  • functions_arithmetic.yaml is untouched
  • Test cases in tests/cases/arithmetic_unsigned/, following the arithmetic_decimal convention
  • Generic udtArg grammar rule for parsing UDT literals in test cases
  • Test framework updated to scan all extension YAML files for function definitions

Closes #944 and follows up community agreement from Substrait Meeting Notes on 28 Jan 2026 that type variations are not appropriate for unsigned integers due to differing semantics.


This change is Reviewable

@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from e4584ca to daccb86 Compare January 30, 2026 15:36
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from daccb86 to d19c87a Compare January 30, 2026 15:38
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo reopened this Jan 30, 2026
Comment thread grammar/FuncTestCaseParser.g4
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 6d28b59 to 4af53f8 Compare January 30, 2026 20:24
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 4af53f8 to 7d6f7f9 Compare January 30, 2026 20:53
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo marked this pull request as ready for review January 30, 2026 21:06
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo marked this pull request as draft January 30, 2026 21:07
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch 3 times, most recently from 8650358 to 10baf81 Compare February 4, 2026 15:53
@kadinrabo kadinrabo marked this pull request as ready for review February 4, 2026 15:59
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall to me.

The one thing I would like to see is that for the test files, let's map over not just the basic tests but also the overflow, null handling, etc tests that are applicable to usigned integers. I think the floating exception tests are the only ones that don't make apply.

Comment thread extensions/unsigned_integers.yaml
Comment thread extensions/unsigned_integers.yaml Outdated
-
name: "divide"
description: >
Divide x by y. Partial values are truncated (rounded towards 0).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the divide description from

-
name: "divide"
description: >
Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then
the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.
If the option is LIMIT then the result is +/-infinity in all cases.
If either x or y are NaN then behavior will be governed by `on_domain_error`.
If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.
impls:
here.

Comment thread extensions/unsigned_integers.yaml
Comment thread tests/cases/arithmetic_unsigned/sum.test Outdated
Comment thread tests/cases/arithmetic_unsigned/add.test
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment on the definition of divide, and I think we should pull modulus out of this entirely because the existing definition could use a once over, but otherwise looks good to me.

The core set of functions you've added, along with the tests files, set a good example for how to add both future functions for unsigned integers, and also for adding new types.

Comment thread extensions/unsigned_integers.yaml Outdated
division_type:
values: [ TRUNCATE, FLOOR ]
overflow:
values: [ SILENT, SATURATE, ERROR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can... modulus even overflow?

Comment thread extensions/unsigned_integers.yaml Outdated
value: u!u8
options:
division_type:
values: [ TRUNCATE, FLOOR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is division_type an option we need here. The signed modulus operator defines 2 tests for them:

# division_type: Examples demonstrating truncate and floor division types
modulus(8::i8, -3::i8) [division_type:TRUNCATE] = 2::i8
modulus(8::i8, -3::i8) [division_type:FLOOR] = -1::i8
. I'm not sure this is applicable for unsigned ints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not doing this for now.

Comment thread extensions/unsigned_integers.yaml Outdated
value: u!u8
options:
overflow:
values: [ SILENT, SATURATE, ERROR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With signed integers, the overflow cases involve a sign change.

# overflow: Examples demonstrating overflow behavior
divide(-9223372036854775808::i64, -1::i64) [overflow:ERROR] = <!ERROR>
divide(-128::i8, -1::i8) [overflow:SATURATE] = 127::i8

With unsigned integers, this case isn't applicable, and in general overflow shouldn't be possible so we can drop this option fully for all of the impls.

Comment thread extensions/unsigned_integers.yaml Outdated
@vbarua vbarua added the PMC Ready PRs ready for review by PMCs label Mar 6, 2026
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks for working on this Kadin.

Looking at this has definitely made me notice some stuff in the existing arithmetic extension that could use improvements as well 🧹

Comment thread tests/coverage/visitor.py
Comment on lines +435 to +436
# Type is "u!" + identifier, e.g., "u!u8"
type_str = "u!" + ctx.Identifier().getText().lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this handle YAML dependency reference or that is out of scope in the test?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By yaml dependency reference, are you talking about this feature for extensions?

dependencies:
# For reusing type classes and type variations from other extension files.
# The keys are namespace identifiers that you can then use as dot-separated
# prefix for type class and type variation names in functions and the base
# type class for variations. The values must be extension URNs, following
# the same format and conventions as those used in the proto plans.
type: object
patternProperties:
"^[a-zA-Z_\\$][a-zA-Z0-9_\\$]*$":
type: string

I don't think the test cases support dependency references, yet

substrait/tests/README.md

Lines 92 to 101 in 00bc3c2

### Spec
```
doc := <version>
<include>
(<dependency>)*
((<test_group>)?(<test_case>)+\n)+
version := ### SUBSTRAIT_SCALAR_TEST: <test_library_version>
include := ### SUBSTRAIT_INCLUDE: <uri>
dependency := ### SUBSTRAIT_DEPENDENCY: <uri>

Copy link
Copy Markdown
Contributor

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 25, 2026

@yongchul what does

+0 using string for representation ...

from you mean here? Is that an approval with no vote?

@jacques-n
Copy link
Copy Markdown
Contributor

jacques-n commented Mar 30, 2026

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend)

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 31, 2026

I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend)

There's a bit of tension here between a good representation for the test cases, and a good representation for actual systems.

For tests, I find it hard to read them when I have to convert the i8 into u8 values in my head. Like this is what a test for overflow would look like:

add(-1::u!u8, 1:u!u8)

What I want is to be able to use the string format in the tests, which makes them readable

add('255':u!u8, '1':u!u8)

and use the signed type encoding for actual systems. Right now we're piggy-backing off of the struct representation defined for the udt to define literals of udt's in the tests. I'm mulling some changes around this, either to allow for multiple struct encodings OR to define test encodings separately.

@jacques-n
Copy link
Copy Markdown
Contributor

There's a bit of tension here between a good representation for the test cases...

Agree that -1 would suck for test cases. I don't have great ideas on how to resolve the mismatch. One other option I could see is to use the next integer size up for the smaller ones and decimal for the biggest one. then representation isn't ugly and always safe (e.g. a u8 always fits in a u16 so a system that doesn't understand u8 but understands u16 could work with it).

@benbellick
Copy link
Copy Markdown
Member

benbellick commented Apr 3, 2026

On this topic of test case representation, we do already abuse things a bit by having decimal represented via bytes:

message Decimal {
// little-endian twos-complement integer representation of complete value
// (ignoring precision) Always 16 bytes in length
bytes value = 1;
// The maximum number of digits allowed in the value.
// the maximum precision is 38.
int32 precision = 2;
// declared scale of decimal literal
int32 scale = 3;

However, in test cases, we just freely use a stringified decimal representation:

sum((2.5000007152557373046875, 7.0000007152557373046875, 0, 7.0000007152557373046875)::dec<23, 22>) = 16.5000021457672119140625::dec?<38, 22>

I'm leaning towards @jacques-n's suggestion now.

How about we do:

  • u8 literals represented as i16
  • u16 literals represented as i32
  • u32 literals represented as i64
  • u64 literals represented as bytes <-- breaks pattern

This way, the first 3 types have simple representations in both plans and in test cases.

Unfortunately, u64 would be a bit of an awkward outsider here, but we could just leave a documentation blurb somewhere specifying that for UDTs with unnatural struct-based representations, we can describe custom representations in text for human readability.

@jacques-n
Copy link
Copy Markdown
Contributor

On this topic of test case representation, we do already abuse things a bit by having decimal represented via bytes:

message Decimal {
// little-endian twos-complement integer representation of complete value
// (ignoring precision) Always 16 bytes in length
bytes value = 1;
// The maximum number of digits allowed in the value.
// the maximum precision is 38.
int32 precision = 2;
// declared scale of decimal literal
int32 scale = 3;

However, in test cases, we just freely use a stringified decimal representation:

sum((2.5000007152557373046875, 7.0000007152557373046875, 0, 7.0000007152557373046875)::dec<23, 22>) = 16.5000021457672119140625::dec?<38, 22>

I'm leaning towards @jacques-n's suggestion now.

How about we do:

  • u8 literals represented as i16
  • u16 literals represented as i32
  • u32 literals represented as i64
  • u64 literals represented as bytes <-- breaks pattern

This way, the first 3 types have simple representations in both plans and in test cases.

Unfortunately, u64 would be a bit of an awkward outsider here, but we could just leave a documentation blurb somewhere specifying that for UDTs with unnatural struct-based representations, we can describe custom representations in text for human readability.

you could use DECIMAL(20,0) for u64 I think (if I can count characters correctly).

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Apr 3, 2026

I'm not a fan of the encode as the next size up scheme. It feels inelegant, breaks down at u64, and it let's people potentially set literals that are too big for the type they are sending.

I've been thinking about a bunch of compilcated things, but maybe we can just do something like:

urn: "extension:io.substrait:unsigned_integers"

types:
  - name: u8
    description: >
      Unsigned 8-bit integer (0 to 255).
      Values are encoded as i8 to be reintepreted as u8
    structure:
      value: i8
    # Decimal string to be interpreted as u8 (i.e '0', '42', '255')
    test_encoding: str #

We get the nice boring u8 is just i8 that needs be re-intepreted for the wire format, and easy to read string format for tests.

test_encoding would just use a standard type string like (i8, list<any1>, etc)

If this seems reasonable, I can expand on it.

@vbarua vbarua self-requested a review April 3, 2026 20:55
@benbellick
Copy link
Copy Markdown
Member

IMO, representing u8 as an i8 that needs to be reinterpreted is more confusing. Doing this means that in JSON renderings of substrait plans, I would see a number which is not the number I am expecting. At that point I would rather the number be opaque so as not to cause confusion.

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Apr 6, 2026

It sounds like folks are generally okay with the concept of a string encoding for tests.

That leaves the encoding for values. The options thus far are:

1: String

u8 / u16 / u32 / u64 -> str

Easy to read in plans, but not necessarily trivial to define an interpretation for. Allows for encoding larger values than are valid for the type. For example '300' for u8.

2: Integer Mapping

u8 / u16 / u32 / u64
i8 / i16 / i32 / i64

Relatively easy to interpret, and impossible to declare values larger than permitted. Potentially difficult to read in plans
because something like 254:u8 would be encoded as -2:i8.

3: Decimal

u8       / u16      / u32       / u64
dec<3,0> / dec<5,0> / dec<10,0> / dec<20>

Effectively byte encoded in plans so not easy to read. Also allows for encoding larger values than are valid. For example 300 is a valid dec<3,0>, but not a valid u8.

4: Next Size Up

 u8 / u16 / u32 / u64
i16 / i32 / i64 / fbin<16>

Two different encodings really, which is itself a bit weird to explain to users.

u8 / u16 / u32 just use the next size up integer. Easy to read, but allows for encoding invalid values. For example, 300 is a valid i16 but not a valid u32. Commentary below applies to the u64 encoding.

5: Bytes

 u8 -> fbin<2>
u16 -> fbin<4>
u32 -> fbin<8>
u64 -> fbin<16>

Not the easiest to read in plans. Interpretation is mostly clear as long as we declare an endianess, which we've done before for decimal literals.

6: Make unsigned integers first class entities

Not discussed in thread, but we could potentially just make unsigned integers part of the core type system. They're common enough 🤷

Personal Opinion

For the value encodings used in the user-defined literal message I'm partial to either:

  • Signed Integers
  • Bytes

Decimal feels like a slightly more constrained String, but we still need to handle invalid values, and the literal representation is just as opaque as bytes. The next size up signed integers also allows invalid values, and falls back to bytes anyways.

cc: @benbellick @yongchul @jacques-n

@jacques-n
Copy link
Copy Markdown
Contributor

I think that what is really coming up here is that we have no vocabulary/specification to define a literal representation of a user defined type. Anybody done any research on how other systems solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of unsigned numeric types

5 participants