Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions src/content/docs/guides/collecting/get-data-from-the-network.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,22 @@ pipeline to convert incoming bytes to events. Inside the nested pipeline,
`$peer.ip` and `$peer.port` identify the connecting client. Set
`resolve_hostnames=true` to also expose `$peer.hostname` from reverse DNS.

### Accept multiple input formats

Use <Op>read_auto</Op> when a TCP endpoint receives data from producers that
don't all use the same format:

```tql
accept_tcp "0.0.0.0:9000" {
read_auto fallback="lines"
}
```

The detector runs once per connection, so one client can send NDJSON while
another sends CSV, Syslog, or another supported format. This pattern is useful
for rapid prototyping, shared intake endpoints, and package pipelines where you
want to normalize different producer formats after parsing.

### Connect to a remote server

Use <Op>from_tcp</Op> to connect to an existing server:
Expand Down Expand Up @@ -223,6 +239,10 @@ select

## See also

- <Op>accept_tcp</Op>
- <Op>accept_udp</Op>
- <Op>from_nic</Op>
- <Op>read_auto</Op>
- <Integration>tcp</Integration>
- <Integration>udp</Integration>
- <Integration>nic</Integration>
Expand Down
13 changes: 13 additions & 0 deletions src/content/docs/guides/collecting/read-and-watch-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,19 @@ from_file "/path/to/file.log" {

The parsing pipeline runs on the file content and must return events.

When file names or extensions don't identify the format reliably, use
<Op>read_auto</Op> to detect the format from the file content:

```tql
from_file "/dropzone/**" {
read_auto fallback="lines"
}
```

This pattern helps with upload directories, partner file drops, and rapid
prototyping with mixed sample files. Keep `fallback="none"` if unknown formats
should fail instead of becoming line-oriented text.

## Directory processing

You can process multiple files efficiently using glob patterns. This section
Expand Down
19 changes: 19 additions & 0 deletions src/content/docs/guides/parsing/parse-delimited-text.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,24 @@ The examples use <Op>from_file</Op> with a [parsing
subpipeline](/reference/programs#parsing-subpipelines) to illustrate each
technique.

## Auto-detect structured text

Use <Op>read_auto</Op> when you don't know the text format yet, or when you want
to prototype against sample files before choosing a concrete reader. It detects
common structured text formats such as NDJSON, CSV, TSV, key-value text, YAML,
Syslog, CEF, and LEEF from the first bytes of the stream:

```tql
from_file "sample.log" {
read_auto fallback="lines"
}
```

With `fallback="lines"`, unsupported UTF-8 input still becomes one event per
line. Keep the default `fallback="none"` when unknown formats should fail fast.
After you know the exact format, switch to the concrete reader when you need
format-specific options.

## Split on newlines

Use <Op>read_lines</Op> to split a byte stream on newline characters. Given this
Expand Down Expand Up @@ -250,6 +268,7 @@ from_file "syslog.txt" {

## See also

- <Op>read_auto</Op>
- <Guide>parsing/parse-binary-data</Guide>
- <Guide>parsing/parse-string-fields</Guide>
- <Guide>collecting/read-and-watch-files</Guide>
Expand Down
12 changes: 12 additions & 0 deletions src/content/docs/reference/operators.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,10 @@ operators:
description: 'Parses an incoming bytes stream into a single event.'
example: 'read_all binary=true'
path: 'reference/operators/read_all'
- name: 'read_auto'
description: 'Detects the input format of a byte stream and selects a matching reader.'
example: 'read_auto fallback="lines"'
path: 'reference/operators/read_auto'
- name: 'read_bitz'
description: 'Parses bytes as *BITZ* format.'
example: 'read_bitz'
Expand Down Expand Up @@ -2231,6 +2235,14 @@ read_all binary=true

</ReferenceCard>

<ReferenceCard title="read_auto" description="Detects the input format of a byte stream and selects a matching reader." href="/reference/operators/read_auto">

```tql
read_auto fallback="lines"
```

</ReferenceCard>

<ReferenceCard title="read_bitz" description="Parses bytes as *BITZ* format." href="/reference/operators/read_bitz">

```tql
Expand Down
187 changes: 187 additions & 0 deletions src/content/docs/reference/operators/read_auto.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: read_auto
category: Parsing
example: 'read_auto fallback="lines"'
---

Detects the input format of a byte stream and selects a matching reader.

```tql
read_auto [fallback=string, max_probe_bytes=uint]
```

## Description

The `read_auto` operator buffers the first bytes of its input as a probe and
asks every reader whether it can parse them. Use it when the input format is
unknown at authoring time, but should still be one of Tenzir's structured
formats.

1. Probe the first bytes of the input, up to `max_probe_bytes`.
2. Dry-run every reader's parser on the probe to find capable readers.
3. Start the most specific capable reader. Without a capable reader, use the
`fallback` reader or fail; when two formats are equally specific, fail
with an ambiguity error.

Detection works in two layers:

1. **Capability**: Every reader dry-runs its actual parser on the probe. For
example, YAML detection runs the YAML parser and requires a structured
document, and CSV detection tokenizes complete lines with the reader's
quoting rules and requires a stable number of fields. A reader only
becomes a candidate when it would accept the probed input.
2. **Specificity**: When several readers are capable of parsing the same
bytes, the most specific format wins. Magic-byte formats such as PCAP or
Parquet rank above JSON dialects such as Suricata EVE or GELF, which rank
above generic NDJSON, which ranks above key-value, delimited, Syslog, and
YAML input. For example, a GELF stream is also valid NDJSON, but the GELF
reader wins because it describes the input more precisely.

Detection is strict by default. If no reader is capable, or if two formats
with equal specificity match the same probe, `read_auto` emits an error
instead of guessing. A reader that needs more evidence delays the decision
until more input arrives, the input ends, or the probe reaches
`max_probe_bytes`. Once a single best candidate exists, `read_auto` starts
that reader, replays the buffered bytes, and forwards the rest of the stream
unchanged.

The built-in detectors cover common JSON, delimited text, security log, and
magic-byte formats, including NDJSON, JSON objects, JSON arrays of objects,
CSV, TSV, key-value text, YAML, Syslog, CEF, LEEF, Zeek TSV, Suricata EVE
JSON, Zeek JSON, GELF, PCAP, Feather, BITZ, and Parquet. Formats that accept
nearly arbitrary text never participate in detection: space-separated values
look like prose, so select <Op>read_ssv</Op> explicitly, and Syslog messages
without a `<PRI>` prefix look like free-form text, so they only match via
`fallback`.

The output uses the schema name that the selected reader would normally assign.
For example, detected CSV input produces the same schema name as
<Op>read_csv</Op>, and detected NDJSON input produces the same schema name as
<Op>read_ndjson</Op>. Inspect `@name` to see the schema name. `read_auto` does
not add a separate field with the detected format.

Use `read_auto` for exploratory pipelines where you want to try sample data
quickly, for file drops where names don't reliably encode the format, and for
multi-format ingestion endpoints. For example, <Op>accept_tcp</Op> can run
`read_auto` per connection so one client sends NDJSON while another sends CSV,
Syslog, or another supported format.

Prefer a concrete reader when you already know the format or need reader-specific
options such as `unflatten_separator` for <Op>read_ndjson</Op>. `read_auto`
selects the reader once for each byte stream and expects the remaining bytes in
that stream to use the same format.

### `fallback = string (optional)`

Controls what happens when no detector matches.

Valid values are:

- `"none"`: Emit an error. This is the default.
- `"lines"`: Use <Op>read_lines</Op>. The input must be valid UTF-8.
Comment thread
mavam marked this conversation as resolved.
- `"all"`: Use <Op>read_all</Op>. `read_auto` uses the current probe to
choose between text and binary output: valid UTF-8 probe bytes select
`read_all`, while invalid probe bytes select `read_all binary=true`. If
binary input can start with a valid UTF-8 prefix longer than
`max_probe_bytes`, use a larger probe limit or <Op>read_all</Op> with
`binary=true` directly.

`read_auto` uses a fallback only after the probe is final, either because the
input ended or because the probe reached `max_probe_bytes`. For long-lived
streams with unknown plain-text input, lower `max_probe_bytes` to reduce startup
latency or use <Op>read_lines</Op> directly.

### `max_probe_bytes = uint (optional)`

The maximum number of bytes to inspect before forcing a detection decision.

Defaults to `1Mi` bytes.

## Examples

### Detect JSON lines

Given this input:

```json title="events.ndjson"
{"x":1}
{"x":2}
```

Use `read_auto` where you would normally use a concrete reader:

```tql
from_file "events.ndjson" {
read_auto
}
```

```tql
{x: 1}
{x: 2}
```

### Fall back to lines

For arbitrary UTF-8 text, opt into line-based parsing explicitly:

```txt title="messages.txt"
hello
world
```

```tql
from_file "messages.txt" {
read_auto fallback="lines"
}
```

```tql
{line: "hello"}
{line: "world"}
```

### Fall back to a single event

Use `fallback="all"` when unknown input should become one event instead of one
event per line:

```tql
from_file "payload.bin" {
read_auto fallback="all"
}
```

If the input is binary, the resulting event contains a `blob` value in the
`data` field.

### Accept multiple formats over TCP

Use `read_auto` in a network listener when the endpoint accepts producers with
different formats:

```tql
accept_tcp "0.0.0.0:9000" {
read_auto fallback="lines"
}
```

The detector runs separately for each connection. This makes the pattern useful
for rapid prototyping, intake endpoints shared by several teams, and package
pipelines that normalize data only after the parser has selected the input
format.

## See Also

- <Op>accept_tcp</Op>
- <Op>from_file</Op>
- <Op>read_all</Op>
- <Op>read_csv</Op>
- <Op>read_json</Op>
- <Op>read_lines</Op>
- <Op>read_ndjson</Op>
- <Op>read_syslog</Op>
- <Op>read_yaml</Op>
- <Guide>collecting/get-data-from-the-network</Guide>
- <Guide>collecting/read-and-watch-files</Guide>
- <Guide>parsing/parse-delimited-text</Guide>
Loading