Skip to content

Work on Version 1.17.0#331

Open
tilo wants to merge 55 commits into
mainfrom
version-1.17.0
Open

Work on Version 1.17.0#331
tilo wants to merge 55 commits into
mainfrom
version-1.17.0

Conversation

@tilo
Copy link
Copy Markdown
Owner

@tilo tilo commented Apr 13, 2026

previous versions of SmarterCSV required that we do a IO.rewind after automatic detection of row_sep and col_sep. This limits the use cases of the gem.
We also want to be able to use it for streaming input.

Work on this branch:

  • no more IO.rewind when doing auto-detection

This adds a peekable IO buffer

  • it fetches the first N bytes from the input into the buffer
  • does the autodetection within the buffer
  • rewinds the buffer (not the IO)
  • starts the actual CSV processing by pre-pending the buffer to the already progressed position in the IO stream

SmarterCSV 1.17.0

A features-and-quality release — non-seekable streaming inputs, a structured warnings system, Rails-friendly defaults, and a round of parser performance work. No breaking changes; all 1.16.x code works unmodified. Backwards-compatible to Ruby 2.5.

RSpec tests: 1,434 → 2,201 (+767 since 1.16.4).

Headline features

  • Non-seekable streaming inputs — SmarterCSV now reads directly from any IO, including streams that don't support rewind/seek (pipes, STDIN, Zlib::GzipReader, HTTP/S3 response bodies). No need to materialize the file on disk.

  • Auto-detection of row_sep/col_sep continues to work on these inputs via an internal rewindable buffer, so the underlying source never needs to rewind.

  • Structured warnings — auto-detection and config warnings are collected on the Reader as a de-duped histogram (reader.warnings), in addition to being emitted to a log sink.
    Class-level SmarterCSV.warnings mirrors SmarterCSV.errors (per-thread, cleared per call).
    When Rails.logger is present, warnings route through it at the declared severity; otherwise Kernel#warn. Codes: :chunk_size_default, :header_a_method, :utf8_missing_binary_mode, :no_clear_row_sep, :no_row_sep_found.

New / changed options

  • buffer_size — now public; peek-buffer chunk size for non-seekable inputs. Default 16_384. Out-of-range values warn and clamp. No effect on seekable inputs (paths, File, StringIO).
  • auto_row_sep_chars — default changed 5004096 (sized to cover wide headers in one read). Behavior change: nil/0 no longer means "scan whole file" — they fall back to the default with a warning; the total scan is hard-capped at 64KB.

Bug fixes

  • Files ending in a lone \r are now correctly detected as \r-terminated (was falling through to a "no clear row separator" warning).
  • remove_empty_values now treats Unicode whitespace as empty — a field containing only whitespace, including NBSP (U+00A0), ideographic space (U+3000), etc., is dropped, matching Ruby's String#blank?. (Previously only ASCII whitespace counted, and only Rails apps got the Unicode behavior via blank? — an inconsistency that's now gone.) Identical with or without the C extension.
  • remove_zero_values now also removes signed zeros (+0, -0, -0.0, +0.00, …), like 0 / 0.0. (Only applies with remove_zero_values: true, off by default.)
  • Better auto-detection of row_sep/col_sep on files with comment headers / irregular starts; guess_line_ending now scans in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority.

Performance — 1.16.4 → 1.17.0

Apple M3, Ruby 3.4.7, 40 iterations × 8 runs, median across runs (p10-trimmed). The C parser's core line-parsing (separator splitting, quote/escape handling, multiline stitching) is unchanged from 1.16.0; the C-path changes this cycle are a faster code path for quoted-field-heavy files (the wins) and Unicode-aware blank detection. ~N% error = within ±3%, the run-to-run margin — effectively unchanged.

C-accelerated path (the default)

Screenshot 2026-05-11 at 16 45 20

Quote-heavy / large-field / wide files run 7–22% faster; everything else is within ±3% (effectively unchanged — the short-line / many-tiny-field files show a small consistent uptick from the larger default auto-detection scan window, auto_row_sep_chars; dial it down if that matters).

Ruby fallback path (acceleration: false)

Screenshot 2026-05-11 at 16 45 49

Faster on nearly every file (4–20%), biggest on wide / many-small-field CSVs — from in-place stripping in the no-quote split path, a first-byte fast-reject before numeric conversion, and per-row / per-value overhead removed from the hash transformations. (long_fields_40k and sensor_data sit at parity — not per-field-transform-bound.)


See CHANGELOG.md and docs/releases/1.17.0/ (changes.md, performance_notes.md) for full detail.

tilo and others added 30 commits April 11, 2026 17:46
  PeekableIO: buffer stores raw bytes in external encoding; maybe_transcode
  applies ext→int conversion on read-out. @buffer_frozen flag prevents
  premature delegation to @io before rewind, so all bytes consumed during
  auto-detection are replayable. each_char falls back to ASCII_8BIT (not
  UTF-8) for sources with no declared encoding.

  Reader: enforce_utf8_encoding was using force_encoding(utf-8) which
  silently dropped non-ASCII bytes from ISO-8859-1, Windows-1252, Shift-JIS
  and other encodings. Now uses line.encoding as the source in encode() for
  correct transcoding.

  Use Encoding::ASCII_8BIT consistently (Encoding::BINARY is an alias).

  Add comprehensive tests: multi-encoding unit tests, transcoding pair
  integration tests, non-seekable stream transcoding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant