feat(extensions): add unsigned integer extension types (u8, u16, u32, u64) by kadinrabo · Pull Request #953 · substrait-io/substrait

kadinrabo · 2026-01-29T18:12:01Z

Description

Adds unsigned integer types (u8, u16, u32, u64) as first-class extension types with arithmetic function support and test coverage.

Self-contained unsigned_integers.yaml with type definitions (string structure encoding) and arithmetic function overloads (add, subtract, multiply, divide, modulus, sum, min, max)
functions_arithmetic.yaml is untouched
Test cases in tests/cases/arithmetic_unsigned/, following the arithmetic_decimal convention
Generic udtArg grammar rule for parsing UDT literals in test cases
Test framework updated to scan all extension YAML files for function definitions

Closes #944 and follows up community agreement from Substrait Meeting Notes on 28 Jan 2026 that type variations are not appropriate for unsigned integers due to differing semantics.

This change is

… u64)

vbarua

This looks good overall to me.

The one thing I would like to see is that for the test files, let's map over not just the basic tests but also the overflow, null handling, etc tests that are applicable to usigned integers. I think the floating exception tests are the only ones that don't make apply.

vbarua · 2026-02-27T21:13:52Z

+  -
+    name: "divide"
+    description: >
+      Divide x by y. Partial values are truncated (rounded towards 0).


We should include the divide description from

substrait/extensions/functions_arithmetic.yaml

Lines 177 to 186 in 2705258

-

name: "divide"

description: >

Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).

The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then

the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.

If the option is LIMIT then the result is +/-infinity in all cases.

If either x or y are NaN then behavior will be governed by `on_domain_error`.

If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.

impls:

here.

vbarua

Left one comment on the definition of divide, and I think we should pull modulus out of this entirely because the existing definition could use a once over, but otherwise looks good to me.

The core set of functions you've added, along with the tests files, set a good example for how to add both future functions for unsigned integers, and also for adding new types.

vbarua · 2026-03-05T00:12:27Z

+          division_type:
+            values: [ TRUNCATE, FLOOR ]
+          overflow:
+            values: [ SILENT, SATURATE, ERROR ]


Can... modulus even overflow?

vbarua · 2026-03-05T00:13:10Z

+            value: u!u8
+        options:
+          division_type:
+            values: [ TRUNCATE, FLOOR ]


Is division_type an option we need here. The signed modulus operator defines 2 tests for them:

substrait/tests/cases/arithmetic/modulus.test

Lines 18 to 20 in e4ce3f8

# division_type: Examples demonstrating truncate and floor division types

modulus(8::i8, -3::i8) [division_type:TRUNCATE] = 2::i8

modulus(8::i8, -3::i8) [division_type:FLOOR] = -1::i8

. I'm not sure this is applicable for unsigned ints.

vbarua · 2026-03-05T00:17:04Z

I'm okay with not doing this for now.

vbarua · 2026-03-05T00:19:12Z

+            value: u!u8
+        options:
+          overflow:
+            values: [ SILENT, SATURATE, ERROR ]


With signed integers, the overflow cases involve a sign change.

# overflow: Examples demonstrating overflow behavior divide(-9223372036854775808::i64, -1::i64) [overflow:ERROR] = <!ERROR> divide(-128::i8, -1::i8) [overflow:SATURATE] = 127::i8

With unsigned integers, this case isn't applicable, and in general overflow shouldn't be possible so we can drop this option fully for all of the impls.

vbarua

Changes look good to me, thanks for working on this Kadin.

Looking at this has definitely made me notice some stuff in the existing arithmetic extension that could use improvements as well 🧹

yongchul · 2026-03-18T17:05:14Z

+        # Type is "u!" + identifier, e.g., "u!u8"
+        type_str = "u!" + ctx.Identifier().getText().lower()


does this handle YAML dependency reference or that is out of scope in the test?

By yaml dependency reference, are you talking about this feature for extensions?

substrait/text/simple_extensions_schema.yaml

Lines 10 to 19 in 00bc3c2

dependencies:

# For reusing type classes and type variations from other extension files.

# The keys are namespace identifiers that you can then use as dot-separated

# prefix for type class and type variation names in functions and the base

# type class for variations. The values must be extension URNs, following

# the same format and conventions as those used in the proto plans.

type: object

patternProperties:

"^[a-zA-Z_\\$][a-zA-Z0-9_\\$]*$":

type: string

I don't think the test cases support dependency references, yet

substrait/tests/README.md

Lines 92 to 101 in 00bc3c2

### Spec

```

doc := <version>

<include>

(<dependency>)*

((<test_group>)?(<test_case>)+\n)+

version := ### SUBSTRAIT_SCALAR_TEST: <test_library_version>

include := ### SUBSTRAIT_INCLUDE: <uri>

dependency := ### SUBSTRAIT_DEPENDENCY: <uri>

yongchul

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

vbarua · 2026-03-25T01:02:20Z

@yongchul what does

+0 using string for representation ...

from you mean here? Is that an approval with no vote?

jacques-n · 2026-03-30T21:20:19Z

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend)

vbarua · 2026-03-31T04:42:21Z

I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend)

There's a bit of tension here between a good representation for the test cases, and a good representation for actual systems.

For tests, I find it hard to read them when I have to convert the i8 into u8 values in my head. Like this is what a test for overflow would look like:

add(-1::u!u8, 1:u!u8)

What I want is to be able to use the string format in the tests, which makes them readable

add('255':u!u8, '1':u!u8)

and use the signed type encoding for actual systems. Right now we're piggy-backing off of the struct representation defined for the udt to define literals of udt's in the tests. I'm mulling some changes around this, either to allow for multiple struct encodings OR to define test encodings separately.

jacques-n · 2026-03-31T17:20:09Z

There's a bit of tension here between a good representation for the test cases...

Agree that -1 would suck for test cases. I don't have great ideas on how to resolve the mismatch. One other option I could see is to use the next integer size up for the smaller ones and decimal for the biggest one. then representation isn't ugly and always safe (e.g. a u8 always fits in a u16 so a system that doesn't understand u8 but understands u16 could work with it).

benbellick · 2026-04-03T16:10:09Z

On this topic of test case representation, we do already abuse things a bit by having decimal represented via bytes:

substrait/proto/substrait/algebra.proto

Lines 1060 to 1068 in 046633f

    
           message Decimal { 
        
             // little-endian twos-complement integer representation of complete value 
        
             // (ignoring precision) Always 16 bytes in length 
        
             bytes value = 1; 
        
             // The maximum number of digits allowed in the value. 
        
             // the maximum precision is 38. 
        
             int32 precision = 2; 
        
             // declared scale of decimal literal 
        
             int32 scale = 3;

However, in test cases, we just freely use a stringified decimal representation:

substrait/tests/cases/arithmetic_decimal/sum_decimal.test

Line 8 in 046633f

    
           sum((2.5000007152557373046875, 7.0000007152557373046875, 0, 7.0000007152557373046875)::dec<23, 22>) = 16.5000021457672119140625::dec?<38, 22>

I'm leaning towards @jacques-n's suggestion now.

How about we do:

u8 literals represented as i16
u16 literals represented as i32
u32 literals represented as i64
u64 literals represented as bytes <-- breaks pattern

This way, the first 3 types have simple representations in both plans and in test cases.

Unfortunately, u64 would be a bit of an awkward outsider here, but we could just leave a documentation blurb somewhere specifying that for UDTs with unnatural struct-based representations, we can describe custom representations in text for human readability.

jacques-n · 2026-04-03T19:12:33Z

On this topic of test case representation, we do already abuse things a bit by having decimal represented via bytes:

substrait/proto/substrait/algebra.proto

Lines 1060 to 1068 in 046633f

message Decimal {

// little-endian twos-complement integer representation of complete value

// (ignoring precision) Always 16 bytes in length

bytes value = 1;

// The maximum number of digits allowed in the value.

// the maximum precision is 38.

int32 precision = 2;

// declared scale of decimal literal

int32 scale = 3;

However, in test cases, we just freely use a stringified decimal representation:

substrait/tests/cases/arithmetic_decimal/sum_decimal.test

Line 8 in 046633f

sum((2.5000007152557373046875, 7.0000007152557373046875, 0, 7.0000007152557373046875)::dec<23, 22>) = 16.5000021457672119140625::dec?<38, 22>

I'm leaning towards @jacques-n's suggestion now.

How about we do:

u8 literals represented as i16

u16 literals represented as i32

u32 literals represented as i64

u64 literals represented as bytes <-- breaks pattern

This way, the first 3 types have simple representations in both plans and in test cases.

Unfortunately, u64 would be a bit of an awkward outsider here, but we could just leave a documentation blurb somewhere specifying that for UDTs with unnatural struct-based representations, we can describe custom representations in text for human readability.

you could use DECIMAL(20,0) for u64 I think (if I can count characters correctly).

vbarua · 2026-04-03T20:51:23Z

I'm not a fan of the encode as the next size up scheme. It feels inelegant, breaks down at u64, and it let's people potentially set literals that are too big for the type they are sending.

I've been thinking about a bunch of compilcated things, but maybe we can just do something like:

urn: "extension:io.substrait:unsigned_integers"

types:
  - name: u8
    description: >
      Unsigned 8-bit integer (0 to 255).
      Values are encoded as i8 to be reintepreted as u8
    structure:
      value: i8
    # Decimal string to be interpreted as u8 (i.e '0', '42', '255')
    test_encoding: str #

We get the nice boring u8 is just i8 that needs be re-intepreted for the wire format, and easy to read string format for tests.

test_encoding would just use a standard type string like (i8, list<any1>, etc)

If this seems reasonable, I can expand on it.

benbellick · 2026-04-03T21:08:31Z

IMO, representing u8 as an i8 that needs to be reinterpreted is more confusing. Doing this means that in JSON renderings of substrait plans, I would see a number which is not the number I am expecting. At that point I would rather the number be opaque so as not to cause confusion.

vbarua · 2026-04-06T21:19:48Z

It sounds like folks are generally okay with the concept of a string encoding for tests.

That leaves the encoding for values. The options thus far are:

1: String

u8 / u16 / u32 / u64 -> str

Easy to read in plans, but not necessarily trivial to define an interpretation for. Allows for encoding larger values than are valid for the type. For example '300' for u8.

2: Integer Mapping

u8 / u16 / u32 / u64
i8 / i16 / i32 / i64

Relatively easy to interpret, and impossible to declare values larger than permitted. Potentially difficult to read in plans
because something like 254:u8 would be encoded as -2:i8.

3: Decimal

u8       / u16      / u32       / u64
dec<3,0> / dec<5,0> / dec<10,0> / dec<20>

Effectively byte encoded in plans so not easy to read. Also allows for encoding larger values than are valid. For example 300 is a valid dec<3,0>, but not a valid u8.

4: Next Size Up

 u8 / u16 / u32 / u64
i16 / i32 / i64 / fbin<16>

Two different encodings really, which is itself a bit weird to explain to users.

u8 / u16 / u32 just use the next size up integer. Easy to read, but allows for encoding invalid values. For example, 300 is a valid i16 but not a valid u32. Commentary below applies to the u64 encoding.

5: Bytes

 u8 -> fbin<2>
u16 -> fbin<4>
u32 -> fbin<8>
u64 -> fbin<16>

Not the easiest to read in plans. Interpretation is mostly clear as long as we declare an endianess, which we've done before for decimal literals.

6: Make unsigned integers first class entities

Not discussed in thread, but we could potentially just make unsigned integers part of the core type system. They're common enough 🤷

Personal Opinion

For the value encodings used in the user-defined literal message I'm partial to either:

Signed Integers
Bytes

Decimal feels like a slightly more constrained String, but we still need to handle invalid values, and the literal representation is just as opaque as bytes. The next size up signed integers also allows invalid values, and falls back to bytes anyways.

cc: @benbellick @yongchul @jacques-n

jacques-n · 2026-04-07T23:37:22Z

I think that what is really coming up here is that we have no vocabulary/specification to define a literal representation of a user defined type. Anybody done any research on how other systems solve this?

kadinrabo force-pushed the feat/unsigned-extension-types branch from e4584ca to daccb86 Compare January 30, 2026 15:36

feat(extensions): add unsigned integer extension types (u8, u16, u32,…

d19c87a

… u64)

kadinrabo force-pushed the feat/unsigned-extension-types branch from daccb86 to d19c87a Compare January 30, 2026 15:38

feat(extensions): add arithmetic function impls for unsigned types

a413f3e

kadinrabo closed this Jan 30, 2026

kadinrabo reopened this Jan 30, 2026

benbellick reviewed Jan 30, 2026

View reviewed changes

Comment thread grammar/FuncTestCaseParser.g4

kadinrabo force-pushed the feat/unsigned-extension-types branch from 6d28b59 to 4af53f8 Compare January 30, 2026 20:24

kadinrabo closed this Jan 30, 2026

kadinrabo reopened this Jan 30, 2026

kadinrabo added 4 commits January 30, 2026 15:53

feat(tests): add UDT argument support in test framework

b4bd79b

chore: regenerate ANTLR parsers

719ddb5

feat(tests): add unsigned integer test cases

37a1a40

chore: update test counts and baseline

41cee40

kadinrabo force-pushed the feat/unsigned-extension-types branch from 4af53f8 to 7d6f7f9 Compare January 30, 2026 20:53

kadinrabo closed this Jan 30, 2026

kadinrabo marked this pull request as ready for review January 30, 2026 21:06

kadinrabo requested review from EpsilonPrime, cpcloud, jacques-n, vbarua, westonpace and yongchul as code owners January 30, 2026 21:06

kadinrabo reopened this Jan 30, 2026

kadinrabo marked this pull request as draft January 30, 2026 21:07

chore: add dependency on extension_types_numeric

10baf81

kadinrabo force-pushed the feat/unsigned-extension-types branch 3 times, most recently from 8650358 to 10baf81 Compare February 4, 2026 15:53

kadinrabo marked this pull request as ready for review February 4, 2026 15:59

vbarua reviewed Feb 27, 2026

View reviewed changes

kadinrabo added 2 commits March 2, 2026 18:26

improve type descriptions and divide description

215fcee

add overflow and null handling test cases

cd6b2a8

vbarua reviewed Mar 5, 2026

View reviewed changes

kadinrabo added 2 commits March 5, 2026 12:38

remove overflow option from unsigned divide

fce64b0

remove modulus function and tests

04a8882

vbarua added the PMC Ready PRs ready for review by PMCs label Mar 6, 2026

vbarua approved these changes Mar 6, 2026

View reviewed changes

merge upstream/main, regenerate parser and baseline

7c4c54a

yongchul reviewed Mar 18, 2026

View reviewed changes

yongchul approved these changes Mar 18, 2026

View reviewed changes

merge upstream/main, fix UDT nullability and grammar ordering

3ac870f

vbarua self-requested a review April 3, 2026 20:55

benbellick mentioned this pull request Apr 23, 2026

feat: add user-defined type literal syntax to test case grammar #1057

Open

	-
	name: "divide"
	description: >
	Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).
	The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then
	the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.
	If the option is LIMIT then the result is +/-infinity in all cases.
	If either x or y are NaN then behavior will be governed by `on_domain_error`.
	If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.
	impls:

	# division_type: Examples demonstrating truncate and floor division types
	modulus(8::i8, -3::i8) [division_type:TRUNCATE] = 2::i8
	modulus(8::i8, -3::i8) [division_type:FLOOR] = -1::i8

		# Type is "u!" + identifier, e.g., "u!u8"
		type_str = "u!" + ctx.Identifier().getText().lower()

	dependencies:
	# For reusing type classes and type variations from other extension files.
	# The keys are namespace identifiers that you can then use as dot-separated
	# prefix for type class and type variation names in functions and the base
	# type class for variations. The values must be extension URNs, following
	# the same format and conventions as those used in the proto plans.
	type: object
	patternProperties:
	"^[a-zA-Z_\\$][a-zA-Z0-9_\\$]*$":
	type: string

	### Spec

	```
	doc := <version>
	<include>
	(<dependency>)*
	((<test_group>)?(<test_case>)+\n)+
	version := ### SUBSTRAIT_SCALAR_TEST: <test_library_version>
	include := ### SUBSTRAIT_INCLUDE: <uri>
	dependency := ### SUBSTRAIT_DEPENDENCY: <uri>

Conversation

kadinrabo commented Jan 29, 2026 • edited by jacques-n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vbarua Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

vbarua Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

vbarua Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

vbarua Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

vbarua Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

yongchul Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

vbarua Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

yongchul left a comment

Choose a reason for hiding this comment

Uh oh!

vbarua commented Mar 25, 2026

Uh oh!

jacques-n commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbarua commented Mar 31, 2026

Uh oh!

jacques-n commented Mar 31, 2026

Uh oh!

benbellick commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacques-n commented Apr 3, 2026

Uh oh!

vbarua commented Apr 3, 2026

Uh oh!

benbellick commented Apr 3, 2026

Uh oh!

vbarua commented Apr 6, 2026

1: String

2: Integer Mapping

3: Decimal

4: Next Size Up

5: Bytes

6: Make unsigned integers first class entities

Personal Opinion

Uh oh!

jacques-n commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

kadinrabo commented Jan 29, 2026 •

edited by jacques-n

Loading

jacques-n commented Mar 30, 2026 •

edited

Loading

benbellick commented Apr 3, 2026 •

edited

Loading