Skip to content

fix(clickhouse sink, aws_s3 sink): use dedicated batch_encoding types#25340

Merged
pront merged 13 commits intovectordotdev:masterfrom
flaviofcruz:fix-aws-clickhouse-batch-encoding
May 1, 2026
Merged

fix(clickhouse sink, aws_s3 sink): use dedicated batch_encoding types#25340
pront merged 13 commits intovectordotdev:masterfrom
flaviofcruz:fix-aws-clickhouse-batch-encoding

Conversation

@flaviofcruz
Copy link
Copy Markdown
Contributor

@flaviofcruz flaviofcruz commented May 1, 2026

Summary

The clickhouse and aws_s3 sinks previously shared the BatchSerializerConfig schema for their batch_encoding field. That schema advertised every batch codec (arrow_stream, parquet, proto_batch) even though each sink only supports one — non-supported variants were rejected at config-build time, but showed up in generated docs and YAML schemas.

This PR introduces dedicated, per-sink batch encoding types:

  • clickhouse sink: ClickhouseBatchEncoding — only arrow_stream.
  • aws_s3 sink: S3BatchEncoding — only parquet.

The generated component docs and config schemas now reflect what each sink actually accepts.

Vector configuration

sinks:
  my_clickhouse:
    type: clickhouse
    endpoint: http://localhost:8123
    table: events
    format: arrow_stream
    batch_encoding:
      codec: arrow_stream

  my_s3:
    type: aws_s3
    bucket: my-bucket
    batch_encoding:
      codec: parquet

How did you test this PR?

  • make check-clippy (clean)
  • make check-generated-docs
  • Existing clickhouse and aws_s3 unit tests

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

(Configs that already used codec: arrow_stream / codec: parquet continue to work; only the schema surface changes.)

Does this PR include user facing changes?

  • Yes. Changelog fragment: changelog.d/clickhouse_aws_s3_dedicated_batch_encoding.fix.md.

References

Closes #25323

@github-actions github-actions Bot added domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation work in progress labels May 1, 2026
@flaviofcruz flaviofcruz marked this pull request as ready for review May 1, 2026 14:53
@flaviofcruz flaviofcruz requested review from a team as code owners May 1, 2026 14:53
pront and others added 2 commits May 1, 2026 11:22
…sion

So adding a future S3BatchEncoding variant is a compile error rather than
silently defaulting to "parquet".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Avoid rebinding parquet_config to a borrow of itself from
config.batch_encoding; use a short p binding instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop internal type names and dev jargon; lead with the user-visible
behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some commits to fix a few issues. LGTM now!

@flaviofcruz
Copy link
Copy Markdown
Contributor Author

Made some commits to fix a few issues. LGTM now!

Thanks, appreciate it!

pront and others added 2 commits May 1, 2026 13:47
… codec at parse time

Add per-sink deserialization-failure tests that pin the schema-tightening
behavior introduced by the dedicated wrapper enums:

- aws_s3 rejects codec: arrow_stream
- clickhouse rejects codec: parquet

Previously both codecs were accepted by serde and rejected later at
sink-build time; the new wrapper enums move rejection up to parse time,
and these tests prevent silent regression of that contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pront pront enabled auto-merge May 1, 2026 18:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07e05e44ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/sinks/aws_s3/config.rs
pront and others added 2 commits May 1, 2026 14:40
Programmatic users constructing S3SinkConfig directly need to be able
to set the batch_encoding field. With config kept private in
src/sinks/aws_s3/mod.rs, S3BatchEncoding was unnameable outside the
crate, regressing callers that previously used
BatchSerializerConfig::Parquet(...) at the same field.

Gated by codecs-parquet to mirror the type and field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pront pront added this pull request to the merge queue May 1, 2026
Merged via the queue into vectordotdev:master with commit 7923556 May 1, 2026
59 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor batch_encoding to sink-specific config types

2 participants