Work on Version 1.17.0 by tilo · Pull Request #331 · tilo/smarter_csv

tilo · 2026-04-13T16:11:08Z

previous versions of SmarterCSV required that we do a IO.rewind after automatic detection of row_sep and col_sep. This limits the use cases of the gem.
We also want to be able to use it for streaming input.

Work on this branch:

no more IO.rewind when doing auto-detection

This adds a peekable IO buffer

it fetches the first N bytes from the input into the buffer
does the autodetection within the buffer
rewinds the buffer (not the IO)
starts the actual CSV processing by pre-pending the buffer to the already progressed position in the IO stream

SmarterCSV 1.17.0

A features-and-quality release — non-seekable streaming inputs, a structured warnings system, Rails-friendly defaults, and a round of parser performance work. No breaking changes; all 1.16.x code works unmodified. Backwards-compatible to Ruby 2.5.

RSpec tests: 1,434 → 2,201 (+767 since 1.16.4).

Headline features

Non-seekable streaming inputs — SmarterCSV now reads directly from any IO, including streams that don't support rewind/seek (pipes, STDIN, Zlib::GzipReader, HTTP/S3 response bodies). No need to materialize the file on disk.
Auto-detection of row_sep/col_sep continues to work on these inputs via an internal rewindable buffer, so the underlying source never needs to rewind.
Structured warnings — auto-detection and config warnings are collected on the Reader as a de-duped histogram (reader.warnings), in addition to being emitted to a log sink.
Class-level SmarterCSV.warnings mirrors SmarterCSV.errors (per-thread, cleared per call).
When Rails.logger is present, warnings route through it at the declared severity; otherwise Kernel#warn. Codes: :chunk_size_default, :header_a_method, :utf8_missing_binary_mode, :no_clear_row_sep, :no_row_sep_found.

New / changed options

buffer_size — now public; peek-buffer chunk size for non-seekable inputs. Default 16_384. Out-of-range values warn and clamp. No effect on seekable inputs (paths, File, StringIO).
auto_row_sep_chars — default changed 500 → 4096 (sized to cover wide headers in one read). Behavior change: nil/0 no longer means "scan whole file" — they fall back to the default with a warning; the total scan is hard-capped at 64KB.

Bug fixes

Files ending in a lone \r are now correctly detected as \r-terminated (was falling through to a "no clear row separator" warning).
remove_empty_values now treats Unicode whitespace as empty — a field containing only whitespace, including NBSP (U+00A0), ideographic space (U+3000), etc., is dropped, matching Ruby's String#blank?. (Previously only ASCII whitespace counted, and only Rails apps got the Unicode behavior via blank? — an inconsistency that's now gone.) Identical with or without the C extension.
remove_zero_values now also removes signed zeros (+0, -0, -0.0, +0.00, …), like 0 / 0.0. (Only applies with remove_zero_values: true, off by default.)
Better auto-detection of row_sep/col_sep on files with comment headers / irregular starts; guess_line_ending now scans in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority.

Performance — 1.16.4 → 1.17.0

Apple M3, Ruby 3.4.7, 40 iterations × 8 runs, median across runs (p10-trimmed). The C parser's core line-parsing (separator splitting, quote/escape handling, multiline stitching) is unchanged from 1.16.0; the C-path changes this cycle are a faster code path for quoted-field-heavy files (the wins) and Unicode-aware blank detection. ~N% error = within ±3%, the run-to-run margin — effectively unchanged.

C-accelerated path (the default)

Quote-heavy / large-field / wide files run 7–22% faster; everything else is within ±3% (effectively unchanged — the short-line / many-tiny-field files show a small consistent uptick from the larger default auto-detection scan window, auto_row_sep_chars; dial it down if that matters).

Ruby fallback path (acceleration: false)

Faster on nearly every file (4–20%), biggest on wide / many-small-field CSVs — from in-place stripping in the no-quote split path, a first-byte fast-reject before numeric conversion, and per-row / per-value overhead removed from the hash transformations. (long_fields_40k and sensor_data sit at parity — not per-field-transform-bound.)

See CHANGELOG.md and docs/releases/1.17.0/ (changes.md, performance_notes.md) for full detail.

PeekableIO: buffer stores raw bytes in external encoding; maybe_transcode applies ext→int conversion on read-out. @buffer_frozen flag prevents premature delegation to @io before rewind, so all bytes consumed during auto-detection are replayable. each_char falls back to ASCII_8BIT (not UTF-8) for sources with no declared encoding. Reader: enforce_utf8_encoding was using force_encoding(utf-8) which silently dropped non-ASCII bytes from ISO-8859-1, Windows-1252, Shift-JIS and other encodings. Now uses line.encoding as the source in encode() for correct transcoding. Use Encoding::ASCII_8BIT consistently (Encoding::BINARY is an alias). Add comprehensive tests: multi-encoding unit tests, transcoding pair integration tests, non-seekable stream transcoding.

…ableIO.new call

…andling, and full test coverage

…uotes (C and Ruby paths) - make buffer_size a public option - auto_row_sep_chars defaults to 4096 - files ending in a lone \r now correctly detected as \r-terminated - minor internal refactor in parse_csv_line_c - added test coverage for quoted-field behavior - added test coverage for line-ending detection - cleaned up Gemfile and some RSpec files - updated documentation and CHANGELOG

…llowing +0/-0 for remove_zero_values; performance update for numeric conversion

tilo and others added 30 commits April 11, 2026 17:46

adding append capability to the Writer class

ae6f9ff

minor refactor

639567e

add failing specs for PeekableIO wrapper (TDD)

05a2e3d

Work on 1.17.0 : no rewind for auto-detection

8d399ea

fixing issues in peekable_io

ccba5c9

fix issues in peekable_io

a5b3df7

removing unused file

42813a8

fixes to the buffer logic

0ec2ce1

legacy ruby compatibility

d98df0c

legacy ruby compatibility

dc228cb

update

d354fd0

simplify

c20d4ef

buffer improvements + self-extending buffer

62f8e8c

Improve auto-detection flow

9ffd7a1

merge main into branch

ad1e263

buffer_size; more tests

af3c0e3

append strings to buffer instead of new strings; pass options in Peek…

4556284

…ableIO.new call

Fix PeekableIO: frozen-phase straddle bug, IO contract, empty-input h…

7bfc81e

…andling, and full test coverage

updated options handling

199a19c

adding guards

671773d

allow-list in PeekableIO

2daf89e

docs update

e365e4d

fix issue with errors handling

0c071b2

gitignore

9cc254c

conditional encoding

88bed16

only use PeekableIO buffer in streams without rewind capability

5d09d7c

Merge branch 'main' into version-1.17.0

600531a

fix jrubyissue

49a0ed1

improve auto-detection for row_sep

0a5512c

tilo added 25 commits April 23, 2026 08:13

pre4

16a61b6

add warnings to @warnings

a10aaa7

Rails logger handling

6b00c47

improve warnings; add Rails logger support

2252c09

silence warn messages in rspec tests

fb2a0d6

updating docs

1b579ff

updating docs tree

93f3b20

update CHANGELOG

f3f7fb9

updating docs

f36043b

docs update; more tests

c2abafc

docs update

b9d7302

version 1.17.0.pre6 fixing a small performance regression

3bba63c

version 1.17.0.pre11 small improvement on ruby path

f003aec

strip -> strip!

65ecf0f

update

b32d2af

clean-up rspec tests that use parse

c89252b

version 1.17.0.pre12 smaller Ruby improvements

478df77

version pre13; using Rails regex for .blank? that is unicode aware; a…

6047917

…llowing +0/-0 for remove_zero_values; performance update for numeric conversion

docs update; more tests

eff81b5

meh

03065f3

meh

cd09c2e

update rspec tests

9c5ff2c

update test

0d81443

version 1.17.0

0bb8760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Work on Version 1.17.0#331

Work on Version 1.17.0#331
tilo wants to merge 55 commits into
mainfrom
version-1.17.0

tilo commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tilo commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SmarterCSV 1.17.0

Headline features

New / changed options

Bug fixes

Performance — 1.16.4 → 1.17.0

C-accelerated path (the default)

Ruby fallback path (acceleration: false)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tilo commented Apr 13, 2026 •

edited

Loading