feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953
feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
Conversation
e4584ca to
daccb86
Compare
daccb86 to
d19c87a
Compare
6d28b59 to
4af53f8
Compare
4af53f8 to
7d6f7f9
Compare
8650358 to
10baf81
Compare
vbarua
left a comment
There was a problem hiding this comment.
This looks good overall to me.
The one thing I would like to see is that for the test files, let's map over not just the basic tests but also the overflow, null handling, etc tests that are applicable to usigned integers. I think the floating exception tests are the only ones that don't make apply.
| - | ||
| name: "divide" | ||
| description: > | ||
| Divide x by y. Partial values are truncated (rounded towards 0). |
There was a problem hiding this comment.
We should include the divide description from
substrait/extensions/functions_arithmetic.yaml
Lines 177 to 186 in 2705258
vbarua
left a comment
There was a problem hiding this comment.
Left one comment on the definition of divide, and I think we should pull modulus out of this entirely because the existing definition could use a once over, but otherwise looks good to me.
The core set of functions you've added, along with the tests files, set a good example for how to add both future functions for unsigned integers, and also for adding new types.
| division_type: | ||
| values: [ TRUNCATE, FLOOR ] | ||
| overflow: | ||
| values: [ SILENT, SATURATE, ERROR ] |
| value: u!u8 | ||
| options: | ||
| division_type: | ||
| values: [ TRUNCATE, FLOOR ] |
There was a problem hiding this comment.
Is division_type an option we need here. The signed modulus operator defines 2 tests for them:
substrait/tests/cases/arithmetic/modulus.test
Lines 18 to 20 in e4ce3f8
There was a problem hiding this comment.
I'm okay with not doing this for now.
| value: u!u8 | ||
| options: | ||
| overflow: | ||
| values: [ SILENT, SATURATE, ERROR ] |
There was a problem hiding this comment.
With signed integers, the overflow cases involve a sign change.
# overflow: Examples demonstrating overflow behavior
divide(-9223372036854775808::i64, -1::i64) [overflow:ERROR] = <!ERROR>
divide(-128::i8, -1::i8) [overflow:SATURATE] = 127::i8
With unsigned integers, this case isn't applicable, and in general overflow shouldn't be possible so we can drop this option fully for all of the impls.
vbarua
left a comment
There was a problem hiding this comment.
Changes look good to me, thanks for working on this Kadin.
Looking at this has definitely made me notice some stuff in the existing arithmetic extension that could use improvements as well 🧹
| # Type is "u!" + identifier, e.g., "u!u8" | ||
| type_str = "u!" + ctx.Identifier().getText().lower() |
There was a problem hiding this comment.
does this handle YAML dependency reference or that is out of scope in the test?
There was a problem hiding this comment.
By yaml dependency reference, are you talking about this feature for extensions?
substrait/text/simple_extensions_schema.yaml
Lines 10 to 19 in 00bc3c2
I don't think the test cases support dependency references, yet
Lines 92 to 101 in 00bc3c2
yongchul
left a comment
There was a problem hiding this comment.
+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.
|
@yongchul what does
from you mean here? Is that an approval with no vote? |
I'm more like a -0.1 for the string representation. why not use the corresponding same sized signed types OR fixed binary types (we have those, right?)? (asking for a friend) |
There's a bit of tension here between a good representation for the test cases, and a good representation for actual systems. For tests, I find it hard to read them when I have to convert the i8 into u8 values in my head. Like this is what a test for overflow would look like: What I want is to be able to use the string format in the tests, which makes them readable and use the signed type encoding for actual systems. Right now we're piggy-backing off of the struct representation defined for the udt to define literals of udt's in the tests. I'm mulling some changes around this, either to allow for multiple struct encodings OR to define test encodings separately. |
Agree that -1 would suck for test cases. I don't have great ideas on how to resolve the mismatch. One other option I could see is to use the next integer size up for the smaller ones and decimal for the biggest one. then representation isn't ugly and always safe (e.g. a u8 always fits in a u16 so a system that doesn't understand u8 but understands u16 could work with it). |
|
On this topic of test case representation, we do already abuse things a bit by having substrait/proto/substrait/algebra.proto Lines 1060 to 1068 in 046633f However, in test cases, we just freely use a stringified decimal representation: I'm leaning towards @jacques-n's suggestion now. How about we do:
This way, the first 3 types have simple representations in both plans and in test cases. Unfortunately, |
you could use DECIMAL(20,0) for u64 I think (if I can count characters correctly). |
|
I'm not a fan of the encode as the next size up scheme. It feels inelegant, breaks down at u64, and it let's people potentially set literals that are too big for the type they are sending. I've been thinking about a bunch of compilcated things, but maybe we can just do something like: urn: "extension:io.substrait:unsigned_integers"
types:
- name: u8
description: >
Unsigned 8-bit integer (0 to 255).
Values are encoded as i8 to be reintepreted as u8
structure:
value: i8
# Decimal string to be interpreted as u8 (i.e '0', '42', '255')
test_encoding: str #We get the nice boring
If this seems reasonable, I can expand on it. |
|
IMO, representing |
|
It sounds like folks are generally okay with the concept of a string encoding for tests. That leaves the encoding for values. The options thus far are: 1: String
Easy to read in plans, but not necessarily trivial to define an interpretation for. Allows for encoding larger values than are valid for the type. For example '300' for u8. 2: Integer MappingRelatively easy to interpret, and impossible to declare values larger than permitted. Potentially difficult to read in plans 3: DecimalEffectively byte encoded in plans so not easy to read. Also allows for encoding larger values than are valid. For example 4: Next Size UpTwo different encodings really, which is itself a bit weird to explain to users. u8 / u16 / u32 just use the next size up integer. Easy to read, but allows for encoding invalid values. For example, 5: BytesNot the easiest to read in plans. Interpretation is mostly clear as long as we declare an endianess, which we've done before for decimal literals. 6: Make unsigned integers first class entitiesNot discussed in thread, but we could potentially just make unsigned integers part of the core type system. They're common enough 🤷 Personal OpinionFor the value encodings used in the user-defined literal message I'm partial to either:
Decimal feels like a slightly more constrained String, but we still need to handle invalid values, and the literal representation is just as opaque as bytes. The next size up signed integers also allows invalid values, and falls back to bytes anyways. |
|
I think that what is really coming up here is that we have no vocabulary/specification to define a literal representation of a user defined type. Anybody done any research on how other systems solve this? |
Description
Adds unsigned integer types (u8, u16, u32, u64) as first-class extension types with arithmetic function support and test coverage.
unsigned_integers.yamlwith type definitions (string structure encoding) and arithmetic function overloads (add, subtract, multiply, divide, modulus, sum, min, max)functions_arithmetic.yamlis untouchedtests/cases/arithmetic_unsigned/, following the arithmetic_decimal conventionCloses #944 and follows up community agreement from Substrait Meeting Notes on 28 Jan 2026 that type variations are not appropriate for unsigned integers due to differing semantics.
This change is