From 790fd803fbd1be967d0fa534552a9827ca566666 Mon Sep 17 00:00:00 2001 From: Ben Bellick Date: Thu, 12 Mar 2026 12:30:23 -0400 Subject: [PATCH 1/2] feat(docs): clarify distinction between enumeration arguments and options --- proto/substrait/algebra.proto | 25 +++++++++++++---------- site/docs/expressions/scalar_functions.md | 22 ++++++++++++++++---- site/docs/extensions/index.md | 8 +++++--- 3 files changed, 37 insertions(+), 18 deletions(-) diff --git a/proto/substrait/algebra.proto b/proto/substrait/algebra.proto index 6cea2c849..90ee482e3 100644 --- a/proto/substrait/algebra.proto +++ b/proto/substrait/algebra.proto @@ -943,8 +943,11 @@ message FunctionArgument { } } -// An optional function argument. Typically used for specifying behavior in -// invalid or corner cases. +// A named, optional behavioral preference for a function call. Options allow +// producers to express preferences about how consumers handle corner cases or +// engine-specific behavior (e.g. overflow or rounding mode). Unlike function +// arguments, options are not required. If omitted, the consumer is free to +// choose any supported behavior. message FunctionOption { // Name of the option to set. If the consumer does not recognize the // option, it must reject the plan. The name is matched case-insensitively @@ -1237,9 +1240,9 @@ message Expression { // matches one of the allowed options. repeated FunctionArgument arguments = 4; - // Options to specify behavior for corner cases, or leave behavior - // unspecified if the consumer does not need specific behavior in these - // cases. + // Optional behavioral preferences for the function call. Unlike + // arguments, these are not required and are not part of the function + // signature. If omitted, the consumer chooses its own behavior. repeated FunctionOption options = 5; // Must be set to the return type of the function, exactly as derived @@ -1276,9 +1279,9 @@ message Expression { // matches one of the allowed options. repeated FunctionArgument arguments = 9; - // Options to specify behavior for corner cases, or leave behavior - // unspecified if the consumer does not need specific behavior in these - // cases. + // Optional behavioral preferences for the function call. Unlike + // arguments, these are not required and are not part of the function + // signature. If omitted, the consumer chooses its own behavior. repeated FunctionOption options = 11; // Must be set to the return type of the function, exactly as derived @@ -1773,9 +1776,9 @@ message AggregateFunction { // the string must case-insensitively match one of the allowed options. repeated FunctionArgument arguments = 7; - // Options to specify behavior for corner cases, or leave behavior - // unspecified if the consumer does not need specific behavior in these - // cases. + // Optional behavioral preferences for the function call. Unlike + // arguments, these are not required and are not part of the function + // signature. If omitted, the consumer chooses its own behavior. repeated FunctionOption options = 8; // Must be set to the return type of the function, exactly as derived diff --git a/site/docs/expressions/scalar_functions.md b/site/docs/expressions/scalar_functions.md index ade0607b3..987a6522d 100644 --- a/site/docs/expressions/scalar_functions.md +++ b/site/docs/expressions/scalar_functions.md @@ -22,7 +22,7 @@ There are three main types of arguments: value arguments, type arguments, and en * Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include `i32`, `fp32`, `VARCHAR<20>`, `List`, etc. Parameterized types are discussed further below. * Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is `truncate( DECIMAL, DECIMAL, i32)`. This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do. -* Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be `extract([DAY|YEAR|MONTH], )`. In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation. +* Enumeration: arguments that require the caller to specify exactly one value from a fixed set of declared string values. They represent choices that are integral to the function's core semantics. An example is `extract([DAY|YEAR|MONTH], )`, where the caller must specify which date part to extract. Note, the value of an enumeration argument cannot be used in type derivation. #### Value Argument Properties @@ -41,7 +41,7 @@ There are three main types of arguments: value arguments, type arguments, and en | Name | A human-readable name for this argument to help clarify use. | Optional, defaults to a name based on position (e.g. `arg0`) | | Description | Additional description of this argument. | Optional | -#### Required Enumeration Properties +#### Enumeration Argument Properties | Property | Description | Required | | ----------- | ------------------------------------------------------------- | ------------------------------------------------------------ | @@ -51,19 +51,33 @@ There are three main types of arguments: value arguments, type arguments, and en ## Options -In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be `OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR]` If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation. +In addition to arguments, each function call may specify zero or more options. Options allow a producer to express preferences about how a consumer handles corner cases or engine-specific behavior. Unlike enumeration arguments, options are not required. If a producer omits an option, the consumer is free to choose any supported behavior. Options are named (not positional) and are not part of the function signature. See [Enumeration Arguments vs Options](#enumeration-arguments-vs-options) for a detailed comparison. + +An example is `OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR]`. If unspecified, the engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or reject the plan. Note, the value of an option cannot be used in type derivation. ### Option Preference A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified `[ERROR, SATURATE]` then the consumer must deliver `ERROR` if it is capable of doing so. If it is not then it may deliver `SATURATE`. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan. -#### Optional Properties +#### Option Properties | Property | Description | Required | | -------- | ---------------------------------------- | -------- | | Values | A list of valid strings for this option. | Required | | Name | A human-readable name for this option. | Required | +### Enumeration Arguments vs Options + +Both enumeration arguments and options accept values from a fixed set of strings, but they serve different purposes. As a general rule, prefer enumeration arguments for anything the caller must decide for the function to be meaningful. Options are best suited for engine-level behavioral properties (like overflow or rounding mode) where a reasonable default exists and the caller may not care which behavior the engine picks. + +| | Enumeration Argument | Option | +| --- | --- | --- | +| Required? | Yes, must be specified by the producer | No, may be omitted | +| Semantics | Core to the function's operation (e.g., which date component to extract) | Behavioral preference for corner cases or engine-specific behavior (e.g., overflow handling) | +| Position | Positional, part of the argument list | Named, separate from arguments | +| In function signature? | Yes (as `req`) | No | +| If omitted | Invalid plan | Consumer chooses its own behavior | + ## Nullability Handling diff --git a/site/docs/extensions/index.md b/site/docs/extensions/index.md index 8129d0524..88462df4c 100644 --- a/site/docs/extensions/index.md +++ b/site/docs/extensions/index.md @@ -88,7 +88,7 @@ A function signature uniquely identifies a function implementation within a sing | Argument Type | Signature Name | |---------------------------------|----------------| -| Required Enumeration | req | +| Enumeration | req | | i8 | i8 | | i16 | i16 | | i32 | i32 | @@ -122,11 +122,13 @@ A function signature uniquely identifies a function implementation within a sing #### Examples +Function-level [options](../expressions/scalar_functions.md#options) are not part of the function signature and do not appear here. Only enumeration arguments (which are positional and required) contribute to the signature as `req`. + | Function Signature | Function Name | | ------------------------------------------------- | ------------------- | -| `add(optional enumeration, i8, i8) => i8` | `add:i8_i8` | +| `add(i8, i8) => i8` | `add:i8_i8` | | `avg(fp32) => fp32` | `avg:fp32` | -| `extract(required enumeration, timestamp) => i64` | `extract:req_ts` | +| `extract(enumeration, timestamp) => i64` | `extract:req_ts` | | `sum(any1) => any1` | `sum:any` | | `concat(str...) => str` | `concat:str` | | `transform(list, func any2>) => list` | `transform:list_func` | From 583bc2c32253289b30a06a975095814cfb176b83 Mon Sep 17 00:00:00 2001 From: Ben Bellick Date: Tue, 17 Mar 2026 10:02:39 -0400 Subject: [PATCH 2/2] docs: add warning about YAML keyword overlap between enums and options --- site/docs/expressions/scalar_functions.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/site/docs/expressions/scalar_functions.md b/site/docs/expressions/scalar_functions.md index 987a6522d..257cb9624 100644 --- a/site/docs/expressions/scalar_functions.md +++ b/site/docs/expressions/scalar_functions.md @@ -78,6 +78,10 @@ Both enumeration arguments and options accept values from a fixed set of strings | In function signature? | Yes (as `req`) | No | | If omitted | Invalid plan | Consumer chooses its own behavior | +!!! warning "YAML keyword overlap" + + In the YAML extension format, both enumeration arguments and function-level options use the keyword `options`. An enumeration argument appears inside `args` as `options: [VAL1, VAL2, ...]`, while a function-level option appears under a top-level `options` key with named sub-keys and a `values` list. Take care not to confuse the two when reading or writing extension YAML files. + ## Nullability Handling