Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -943,8 +943,11 @@ message FunctionArgument {
}
}

// An optional function argument. Typically used for specifying behavior in
// invalid or corner cases.
// A named, optional behavioral preference for a function call. Options allow
// producers to express preferences about how consumers handle corner cases or
// engine-specific behavior (e.g. overflow or rounding mode). Unlike function
// arguments, options are not required. If omitted, the consumer is free to
// choose any supported behavior.
message FunctionOption {
// Name of the option to set. If the consumer does not recognize the
// option, it must reject the plan. The name is matched case-insensitively
Expand Down Expand Up @@ -1237,9 +1240,9 @@ message Expression {
// matches one of the allowed options.
repeated FunctionArgument arguments = 4;

// Options to specify behavior for corner cases, or leave behavior
// unspecified if the consumer does not need specific behavior in these
// cases.
// Optional behavioral preferences for the function call. Unlike
// arguments, these are not required and are not part of the function
// signature. If omitted, the consumer chooses its own behavior.
repeated FunctionOption options = 5;

// Must be set to the return type of the function, exactly as derived
Expand Down Expand Up @@ -1276,9 +1279,9 @@ message Expression {
// matches one of the allowed options.
repeated FunctionArgument arguments = 9;

// Options to specify behavior for corner cases, or leave behavior
// unspecified if the consumer does not need specific behavior in these
// cases.
// Optional behavioral preferences for the function call. Unlike
// arguments, these are not required and are not part of the function
// signature. If omitted, the consumer chooses its own behavior.
repeated FunctionOption options = 11;

// Must be set to the return type of the function, exactly as derived
Expand Down Expand Up @@ -1773,9 +1776,9 @@ message AggregateFunction {
// the string must case-insensitively match one of the allowed options.
repeated FunctionArgument arguments = 7;

// Options to specify behavior for corner cases, or leave behavior
// unspecified if the consumer does not need specific behavior in these
// cases.
// Optional behavioral preferences for the function call. Unlike
// arguments, these are not required and are not part of the function
// signature. If omitted, the consumer chooses its own behavior.
repeated FunctionOption options = 8;

// Must be set to the return type of the function, exactly as derived
Expand Down
26 changes: 22 additions & 4 deletions site/docs/expressions/scalar_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ There are three main types of arguments: value arguments, type arguments, and en

* Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include `i32`, `fp32`, `VARCHAR<20>`, `List<fp32>`, etc. Parameterized types are discussed further below.
* Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is `truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32)`. This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
* Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be `extract([DAY|YEAR|MONTH], <date value>)`. In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.
* Enumeration: arguments that require the caller to specify exactly one value from a fixed set of declared string values. They represent choices that are integral to the function's core semantics. An example is `extract([DAY|YEAR|MONTH], <date value>)`, where the caller must specify which date part to extract. Note, the value of an enumeration argument cannot be used in type derivation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it made sense to drop the line about "while these could also have been implement as constant string..." as I thought that seemed like more of an implementation detail. People are generally familiar with the concept of an enum IMO.

#### Value Argument Properties

Expand All @@ -41,7 +41,7 @@ There are three main types of arguments: value arguments, type arguments, and en
| Name | A human-readable name for this argument to help clarify use. | Optional, defaults to a name based on position (e.g. `arg0`) |
| Description | Additional description of this argument. | Optional |

#### Required Enumeration Properties
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent with the above now.

#### Enumeration Argument Properties

| Property | Description | Required |
| ----------- | ------------------------------------------------------------- | ------------------------------------------------------------ |
Expand All @@ -51,19 +51,37 @@ There are three main types of arguments: value arguments, type arguments, and en

## Options

In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be `OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR]` If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.
In addition to arguments, each function call may specify zero or more options. Options allow a producer to express preferences about how a consumer handles corner cases or engine-specific behavior. Unlike enumeration arguments, options are not required. If a producer omits an option, the consumer is free to choose any supported behavior. Options are named (not positional) and are not part of the function signature. See [Enumeration Arguments vs Options](#enumeration-arguments-vs-options) for a detailed comparison.

An example is `OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR]`. If unspecified, the engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or reject the plan. Note, the value of an option cannot be used in type derivation.

### Option Preference

A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified `[ERROR, SATURATE]` then the consumer must deliver `ERROR` if it is capable of doing so. If it is not then it may deliver `SATURATE`. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.

#### Optional Properties
#### Option Properties

| Property | Description | Required |
| -------- | ---------------------------------------- | -------- |
| Values | A list of valid strings for this option. | Required |
| Name | A human-readable name for this option. | Required |

### Enumeration Arguments vs Options

Both enumeration arguments and options accept values from a fixed set of strings, but they serve different purposes. As a general rule, prefer enumeration arguments for anything the caller must decide for the function to be meaningful. Options are best suited for engine-level behavioral properties (like overflow or rounding mode) where a reasonable default exists and the caller may not care which behavior the engine picks.

| | Enumeration Argument | Option |
| --- | --- | --- |
| Required? | Yes, must be specified by the producer | No, may be omitted |
| Semantics | Core to the function's operation (e.g., which date component to extract) | Behavioral preference for corner cases or engine-specific behavior (e.g., overflow handling) |
| Position | Positional, part of the argument list | Named, separate from arguments |
| In function signature? | Yes (as `req`) | No |
| If omitted | Invalid plan | Consumer chooses its own behavior |

!!! warning "YAML keyword overlap"

In the YAML extension format, both enumeration arguments and function-level options use the keyword `options`. An enumeration argument appears inside `args` as `options: [VAL1, VAL2, ...]`, while a function-level option appears under a top-level `options` key with named sub-keys and a `values` list. Take care not to confuse the two when reading or writing extension YAML files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. 😆 perhaps a short snippet with comment would work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry getting back to you late here. I can include an example. Another option would be to just drop this section entirely and open up a ticket to make this more clear in the future. As an example here:

  -
    name: "add"
    description: "Add two values."
    impls:
      - args:
          - name: x
            value: i8
          - name: y
            value: i8
        options: # this is an option
          overflow:
            values: [ SILENT, SATURATE, ERROR ]
        return: i8
  -
    name: extract
    # ...
      - args:
          - name: component
            options: [ YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND,
                       MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME ] # and this is an enum
            description: The part of the value to extract.
          - name: x
            value: timestamp

I was just trying to call out that both use the YAML key name options which could be confusing 🤔




## Nullability Handling
Expand Down
8 changes: 5 additions & 3 deletions site/docs/extensions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ A function signature uniquely identifies a function implementation within a sing

| Argument Type | Signature Name |
|---------------------------------|----------------|
| Required Enumeration | req |
| Enumeration | req |
| i8 | i8 |
| i16 | i16 |
| i32 | i32 |
Expand Down Expand Up @@ -122,11 +122,13 @@ A function signature uniquely identifies a function implementation within a sing

#### Examples

Function-level [options](../expressions/scalar_functions.md#options) are not part of the function signature and do not appear here. Only enumeration arguments (which are positional and required) contribute to the signature as `req`.

| Function Signature | Function Name |
| ------------------------------------------------- | ------------------- |
| `add(optional enumeration, i8, i8) => i8` | `add:i8_i8` |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doesn't take an enum, it takes an option. It is confusing to refer to this as an "optional enumeration"

| `add(i8, i8) => i8` | `add:i8_i8` |
| `avg(fp32) => fp32` | `avg:fp32` |
| `extract(required enumeration, timestamp) => i64` | `extract:req_ts` |
| `extract(enumeration, timestamp) => i64` | `extract:req_ts` |
| `sum(any1) => any1` | `sum:any` |
| `concat(str...) => str` | `concat:str` |
| `transform(list<any1>, func<any1 -> any2>) => list<any2>` | `transform:list_func` |
Expand Down
Loading