diff --git a/src/content/docs/guides/collecting/get-data-from-the-network.mdx b/src/content/docs/guides/collecting/get-data-from-the-network.mdx index 661211b96..5e8ef8105 100644 --- a/src/content/docs/guides/collecting/get-data-from-the-network.mdx +++ b/src/content/docs/guides/collecting/get-data-from-the-network.mdx @@ -27,6 +27,22 @@ pipeline to convert incoming bytes to events. Inside the nested pipeline, `$peer.ip` and `$peer.port` identify the connecting client. Set `resolve_hostnames=true` to also expose `$peer.hostname` from reverse DNS. +### Accept multiple input formats + +Use read_auto when a TCP endpoint receives data from producers that +don't all use the same format: + +```tql +accept_tcp "0.0.0.0:9000" { + read_auto fallback="lines" +} +``` + +The detector runs once per connection, so one client can send NDJSON while +another sends CSV, Syslog, or another supported format. This pattern is useful +for rapid prototyping, shared intake endpoints, and package pipelines where you +want to normalize different producer formats after parsing. + ### Connect to a remote server Use from_tcp to connect to an existing server: @@ -223,6 +239,10 @@ select ## See also +- accept_tcp +- accept_udp +- from_nic +- read_auto - tcp - udp - nic diff --git a/src/content/docs/guides/collecting/read-and-watch-files.mdx b/src/content/docs/guides/collecting/read-and-watch-files.mdx index 20105cb8d..dd6c19296 100644 --- a/src/content/docs/guides/collecting/read-and-watch-files.mdx +++ b/src/content/docs/guides/collecting/read-and-watch-files.mdx @@ -46,6 +46,19 @@ from_file "/path/to/file.log" { The parsing pipeline runs on the file content and must return events. +When file names or extensions don't identify the format reliably, use +read_auto to detect the format from the file content: + +```tql +from_file "/dropzone/**" { + read_auto fallback="lines" +} +``` + +This pattern helps with upload directories, partner file drops, and rapid +prototyping with mixed sample files. Keep `fallback="none"` if unknown formats +should fail instead of becoming line-oriented text. + ## Directory processing You can process multiple files efficiently using glob patterns. This section diff --git a/src/content/docs/guides/parsing/parse-delimited-text.mdx b/src/content/docs/guides/parsing/parse-delimited-text.mdx index 04d8420d5..b0ee23cf0 100644 --- a/src/content/docs/guides/parsing/parse-delimited-text.mdx +++ b/src/content/docs/guides/parsing/parse-delimited-text.mdx @@ -10,6 +10,24 @@ The examples use from_file with a [parsing subpipeline](/reference/programs#parsing-subpipelines) to illustrate each technique. +## Auto-detect structured text + +Use read_auto when you don't know the text format yet, or when you want +to prototype against sample files before choosing a concrete reader. It detects +common structured text formats such as NDJSON, CSV, TSV, key-value text, YAML, +Syslog, CEF, and LEEF from the first bytes of the stream: + +```tql +from_file "sample.log" { + read_auto fallback="lines" +} +``` + +With `fallback="lines"`, unsupported UTF-8 input still becomes one event per +line. Keep the default `fallback="none"` when unknown formats should fail fast. +After you know the exact format, switch to the concrete reader when you need +format-specific options. + ## Split on newlines Use read_lines to split a byte stream on newline characters. Given this @@ -250,6 +268,7 @@ from_file "syslog.txt" { ## See also +- read_auto - parsing/parse-binary-data - parsing/parse-string-fields - collecting/read-and-watch-files diff --git a/src/content/docs/reference/operators.mdx b/src/content/docs/reference/operators.mdx index 47536eb36..0c00024c0 100644 --- a/src/content/docs/reference/operators.mdx +++ b/src/content/docs/reference/operators.mdx @@ -531,6 +531,10 @@ operators: description: 'Parses an incoming bytes stream into a single event.' example: 'read_all binary=true' path: 'reference/operators/read_all' + - name: 'read_auto' + description: 'Detects the input format of a byte stream and selects a matching reader.' + example: 'read_auto fallback="lines"' + path: 'reference/operators/read_auto' - name: 'read_bitz' description: 'Parses bytes as *BITZ* format.' example: 'read_bitz' @@ -2231,6 +2235,14 @@ read_all binary=true + + +```tql +read_auto fallback="lines" +``` + + + ```tql diff --git a/src/content/docs/reference/operators/read_auto.mdx b/src/content/docs/reference/operators/read_auto.mdx new file mode 100644 index 000000000..bfa2100e6 --- /dev/null +++ b/src/content/docs/reference/operators/read_auto.mdx @@ -0,0 +1,187 @@ +--- +title: read_auto +category: Parsing +example: 'read_auto fallback="lines"' +--- + +Detects the input format of a byte stream and selects a matching reader. + +```tql +read_auto [fallback=string, max_probe_bytes=uint] +``` + +## Description + +The `read_auto` operator buffers the first bytes of its input as a probe and +asks every reader whether it can parse them. Use it when the input format is +unknown at authoring time, but should still be one of Tenzir's structured +formats. + +1. Probe the first bytes of the input, up to `max_probe_bytes`. +2. Dry-run every reader's parser on the probe to find capable readers. +3. Start the most specific capable reader. Without a capable reader, use the + `fallback` reader or fail; when two formats are equally specific, fail + with an ambiguity error. + +Detection works in two layers: + +1. **Capability**: Every reader dry-runs its actual parser on the probe. For + example, YAML detection runs the YAML parser and requires a structured + document, and CSV detection tokenizes complete lines with the reader's + quoting rules and requires a stable number of fields. A reader only + becomes a candidate when it would accept the probed input. +2. **Specificity**: When several readers are capable of parsing the same + bytes, the most specific format wins. Magic-byte formats such as PCAP or + Parquet rank above JSON dialects such as Suricata EVE or GELF, which rank + above generic NDJSON, which ranks above key-value, delimited, Syslog, and + YAML input. For example, a GELF stream is also valid NDJSON, but the GELF + reader wins because it describes the input more precisely. + +Detection is strict by default. If no reader is capable, or if two formats +with equal specificity match the same probe, `read_auto` emits an error +instead of guessing. A reader that needs more evidence delays the decision +until more input arrives, the input ends, or the probe reaches +`max_probe_bytes`. Once a single best candidate exists, `read_auto` starts +that reader, replays the buffered bytes, and forwards the rest of the stream +unchanged. + +The built-in detectors cover common JSON, delimited text, security log, and +magic-byte formats, including NDJSON, JSON objects, JSON arrays of objects, +CSV, TSV, key-value text, YAML, Syslog, CEF, LEEF, Zeek TSV, Suricata EVE +JSON, Zeek JSON, GELF, PCAP, Feather, BITZ, and Parquet. Formats that accept +nearly arbitrary text never participate in detection: space-separated values +look like prose, so select read_ssv explicitly, and Syslog messages +without a `` prefix look like free-form text, so they only match via +`fallback`. + +The output uses the schema name that the selected reader would normally assign. +For example, detected CSV input produces the same schema name as +read_csv, and detected NDJSON input produces the same schema name as +read_ndjson. Inspect `@name` to see the schema name. `read_auto` does +not add a separate field with the detected format. + +Use `read_auto` for exploratory pipelines where you want to try sample data +quickly, for file drops where names don't reliably encode the format, and for +multi-format ingestion endpoints. For example, accept_tcp can run +`read_auto` per connection so one client sends NDJSON while another sends CSV, +Syslog, or another supported format. + +Prefer a concrete reader when you already know the format or need reader-specific +options such as `unflatten_separator` for read_ndjson. `read_auto` +selects the reader once for each byte stream and expects the remaining bytes in +that stream to use the same format. + +### `fallback = string (optional)` + +Controls what happens when no detector matches. + +Valid values are: + +- `"none"`: Emit an error. This is the default. +- `"lines"`: Use read_lines. The input must be valid UTF-8. +- `"all"`: Use read_all. `read_auto` uses the current probe to + choose between text and binary output: valid UTF-8 probe bytes select + `read_all`, while invalid probe bytes select `read_all binary=true`. If + binary input can start with a valid UTF-8 prefix longer than + `max_probe_bytes`, use a larger probe limit or read_all with + `binary=true` directly. + +`read_auto` uses a fallback only after the probe is final, either because the +input ended or because the probe reached `max_probe_bytes`. For long-lived +streams with unknown plain-text input, lower `max_probe_bytes` to reduce startup +latency or use read_lines directly. + +### `max_probe_bytes = uint (optional)` + +The maximum number of bytes to inspect before forcing a detection decision. + +Defaults to `1Mi` bytes. + +## Examples + +### Detect JSON lines + +Given this input: + +```json title="events.ndjson" +{"x":1} +{"x":2} +``` + +Use `read_auto` where you would normally use a concrete reader: + +```tql +from_file "events.ndjson" { + read_auto +} +``` + +```tql +{x: 1} +{x: 2} +``` + +### Fall back to lines + +For arbitrary UTF-8 text, opt into line-based parsing explicitly: + +```txt title="messages.txt" +hello +world +``` + +```tql +from_file "messages.txt" { + read_auto fallback="lines" +} +``` + +```tql +{line: "hello"} +{line: "world"} +``` + +### Fall back to a single event + +Use `fallback="all"` when unknown input should become one event instead of one +event per line: + +```tql +from_file "payload.bin" { + read_auto fallback="all" +} +``` + +If the input is binary, the resulting event contains a `blob` value in the +`data` field. + +### Accept multiple formats over TCP + +Use `read_auto` in a network listener when the endpoint accepts producers with +different formats: + +```tql +accept_tcp "0.0.0.0:9000" { + read_auto fallback="lines" +} +``` + +The detector runs separately for each connection. This makes the pattern useful +for rapid prototyping, intake endpoints shared by several teams, and package +pipelines that normalize data only after the parser has selected the input +format. + +## See Also + +- accept_tcp +- from_file +- read_all +- read_csv +- read_json +- read_lines +- read_ndjson +- read_syslog +- read_yaml +- collecting/get-data-from-the-network +- collecting/read-and-watch-files +- parsing/parse-delimited-text