Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
ae6f9ff
adding append capability to the Writer class
tilo Apr 12, 2026
639567e
minor refactor
tilo Apr 12, 2026
05a2e3d
add failing specs for PeekableIO wrapper (TDD)
tilo Apr 12, 2026
8d399ea
Work on 1.17.0 : no rewind for auto-detection
tilo Apr 13, 2026
ccba5c9
fixing issues in peekable_io
tilo Apr 13, 2026
a5b3df7
fix issues in peekable_io
tilo Apr 13, 2026
42813a8
removing unused file
tilo Apr 13, 2026
3726bc5
Fix encoding/transcoding correctness in PeekableIO and Reader
tilo Apr 14, 2026
0ec2ce1
fixes to the buffer logic
tilo Apr 14, 2026
d98df0c
legacy ruby compatibility
tilo Apr 14, 2026
dc228cb
legacy ruby compatibility
tilo Apr 14, 2026
d354fd0
update
tilo Apr 14, 2026
c20d4ef
simplify
tilo Apr 14, 2026
62f8e8c
buffer improvements + self-extending buffer
tilo Apr 14, 2026
9ffd7a1
Improve auto-detection flow
tilo Apr 14, 2026
ad1e263
merge main into branch
tilo Apr 14, 2026
af3c0e3
buffer_size; more tests
tilo Apr 15, 2026
4556284
append strings to buffer instead of new strings; pass options in Peek…
tilo Apr 15, 2026
7bfc81e
Fix PeekableIO: frozen-phase straddle bug, IO contract, empty-input h…
tilo Apr 16, 2026
199a19c
updated options handling
tilo Apr 16, 2026
671773d
adding guards
tilo Apr 17, 2026
2daf89e
allow-list in PeekableIO
tilo Apr 18, 2026
e365e4d
docs update
tilo Apr 19, 2026
0c071b2
fix issue with errors handling
tilo Apr 19, 2026
9cc254c
gitignore
tilo Apr 19, 2026
88bed16
conditional encoding
tilo Apr 20, 2026
5d09d7c
only use PeekableIO buffer in streams without rewind capability
tilo Apr 21, 2026
600531a
Merge branch 'main' into version-1.17.0
tilo Apr 21, 2026
49a0ed1
fix jrubyissue
tilo Apr 21, 2026
0a5512c
improve auto-detection for row_sep
tilo Apr 22, 2026
16a61b6
pre4
tilo Apr 23, 2026
a10aaa7
add warnings to @warnings
tilo Apr 24, 2026
6b00c47
Rails logger handling
tilo Apr 24, 2026
2252c09
improve warnings; add Rails logger support
tilo Apr 28, 2026
fb2a0d6
silence warn messages in rspec tests
tilo Apr 28, 2026
1b579ff
updating docs
tilo Apr 28, 2026
93f3b20
updating docs tree
tilo Apr 28, 2026
f3f7fb9
update CHANGELOG
tilo Apr 28, 2026
f36043b
updating docs
tilo May 6, 2026
c2abafc
docs update; more tests
tilo May 7, 2026
b9d7302
docs update
tilo May 7, 2026
3bba63c
version 1.17.0.pre6 fixing a small performance regression
tilo May 7, 2026
6fbc272
- performance: short-circuit when quoted fields contain no doubled …
tilo May 11, 2026
f003aec
version 1.17.0.pre11 small improvement on ruby path
tilo May 11, 2026
65ecf0f
strip -> strip!
tilo May 11, 2026
b32d2af
update
tilo May 11, 2026
c89252b
clean-up rspec tests that use parse
tilo May 11, 2026
478df77
version 1.17.0.pre12 smaller Ruby improvements
tilo May 11, 2026
6047917
version pre13; using Rails regex for .blank? that is unicode aware; a…
tilo May 11, 2026
eff81b5
docs update; more tests
tilo May 11, 2026
03065f3
meh
tilo May 11, 2026
cd09c2e
meh
tilo May 11, 2026
9c5ff2c
update rspec tests
tilo May 11, 2026
0d81443
update test
tilo May 11, 2026
0bb8760
version 1.17.0
tilo May 12, 2026
2a49e46
update docs
tilo May 13, 2026
4a83a83
more tests
tilo May 14, 2026
681a1ef
update
tilo May 14, 2026
d42916f
update
tilo May 14, 2026
6c5e3dd
update
tilo May 14, 2026
15270c5
update CHANGELOG
tilo May 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ruby.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ jobs:
- 3.2
- 3.3
- "3.4"
- "4.0"
- head
- truffleruby
- truffleruby-head
Expand Down
11 changes: 10 additions & 1 deletion .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ Layout/SpaceInsideHashLiteralBraces:
Layout/SpaceAroundOperators:
Enabled: false

Lint/ConstantDefinitionInBlock:
Enabled: false

Lint/UnderscorePrefixedVariableName:
Enabled: false

Metrics/AbcSize:
Enabled: false

Expand All @@ -37,6 +43,9 @@ Metrics/ModuleLength:
Metrics/PerceivedComplexity:
Enabled: false

Naming/MethodParameterName:
Enabled: false

Naming/PredicateName:
Enabled: false

Expand Down Expand Up @@ -156,7 +165,7 @@ Style/SymbolArray:
Style/SymbolProc: # old Ruby versions can't do this
Enabled: false

Style/TernaryParentheses:
Style/TernaryParentheses: # parentheses are good!
Enabled: false

Style/TrailingCommaInArrayLiteral:
Expand Down
54 changes: 54 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,60 @@

# SmarterCSV 1.x Change Log

## 1.17.0 (NOT RELEASED)

RSpec tests: **1,434 → 2,210** (+776 tests)

### New Features

* **Streaming IO support** — SmarterCSV now works with non-seekable IO sources such as pipes, STDIN, and Zlib streams.
A rewindable peek buffer transparently captures the first bytes of the stream so that `row_sep` and `col_sep` auto-detection can replay them without requiring the underlying source to support `rewind` or `seek`.

* **Structured warnings** — auto-detection and configuration warnings are now collected on the Reader as a deduped histogram:

```ruby
reader = SmarterCSV::Reader.new('data.csv')
reader.process
reader.warnings # => [{ type:, code:, severity:, message:, count: }, ...]
```

Repeated warnings of the same `(type, code)` are deduped — `count` tracks occurrences. Available codes today: `:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, `:no_clear_row_sep`, `:no_row_sep_found`.

* **Class-level `SmarterCSV.warnings`** accessor — mirrors `SmarterCSV.errors`. Per-thread, cleared at the start of each `.process` / `.parse` / `.each` / `.each_chunk` call. Safe under Puma/Sidekiq.

* **Rails.logger routing** — when `Rails.logger` is present, warnings are routed through it at the severity declared at the call site (`:debug` / `:info` / `:warn` / `:error` / `:fatal`); otherwise `Kernel#warn` is used as a fallback. Detection is cached at construct time, no per-call overhead.

### Improvements

* Improved auto-detection of `row_sep` and `col_sep` — giving more accurate results on files with comment headers.

* Larger scan window for accurate row separator detection on files with wide headers or long first lines.

* `guess_line_ending` now scans the input in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority. Near-tie chunk-boundary artifacts no longer cause spurious warnings; only true ties at the hard cap fall back to `"\n"` and emit a `:no_clear_row_sep` warning at `:error` severity (silent miss-parse risk).

### New / Changed Options

* **`buffer_size` is now a public option** — peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Default `16_384`. Out-of-range values warn and clamp to the supported range rather than raising.

* **`auto_row_sep_chars` default changed to `4096`** (was `500` in 1.16.x). Sized to cover wide-header CSVs in a single read. Bump it higher if your files have very wide headers or long comment preambles.

### Bug Fixes

* **Files ending in a lone `\r`** are now correctly detected as `\r`-terminated instead of falling through to a "no clear row separator" warning.

* **`remove_empty_values` now treats Unicode whitespace as empty** — a field containing only whitespace, including characters like non-breaking space (U+00A0) or ideographic space (U+3000), is now dropped, the same way Ruby's `String#blank?` behaves. Previously only ASCII whitespace counted (and only Rails apps got the Unicode behavior, via `blank?` — an inconsistency that's now gone). Behavior is identical with or without the C extension.

* **`remove_zero_values` now also removes signed zeros** — `+0`, `-0`, `-0.0`, `+0.00`, etc. are recognized as zero and dropped, just like `0` and `0.0`. (Only applies when `remove_zero_values: true`, which is off by default.)

### Performance

Measured against 1.16.4 (Apple M4, Ruby 3.4.7):

* **C-accelerated path (the default):** quote-heavy, large-field, and wide CSVs parse meaningfully faster — roughly **7–22% faster** (city/address-style files ~10–12%; long-field and wide files the most). CSVs with very short lines and many tiny fields are up to ~3% slower — a side effect of the larger default auto-detection scan window (see `auto_row_sep_chars`); set it back to a smaller value if that matters for your workload. Net: solid wins where there's real per-row work, a small cost on the trivially-cheap cases.
* **Ruby fallback path (`acceleration: false`):** faster on nearly every file — typically **3–20% faster** than 1.16.4, with the biggest gains on wide and many-small-field CSVs.

Per-file breakdown: [`docs/releases/1.17.0/performance_notes.md`](docs/releases/1.17.0/performance_notes.md).

## 1.16.4 (2026-04-21) — Bug Fixes

RSpec tests: **1,434 → 1,467** (+33 tests)
Expand Down
15 changes: 10 additions & 5 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@ source 'https://rubygems.org'
# Specify your gem's dependencies in smarter_csv.gemspec
gemspec

gem "rake"
gem "rake-compiler"
group :development do
gem "rake"
gem "rake-compiler"
gem "ostruct" # silences rake's stdlib-deprecation warning during dev
gem "rubocop"
end

gem "awesome_print"
gem 'pry'
gem "rubocop"
group :development, :test do
gem "awesome_print"
gem "pry" # required in spec_helper.rb; also useful in dev console
end

group :test do
gem "rspec"
Expand Down
112 changes: 98 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,13 @@

Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.

In a Rails app, warnings auto-route through `Rails.logger` and instrumentation hooks compose with `ActiveSupport::Notifications` — no setup required. Outside Rails, warnings fall back to `$stderr` and the same APIs work without any framework dependency.

The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.

For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines. The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.
For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
As of 1.17.0, SmarterCSV also accepts **non-seekable streaming inputs** — pipes, `STDIN`, `Zlib::GzipReader`, and HTTP responses — with no need to materialize the file on disk first.
The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.

The interface is intentionally designed to robustly handle messy real-world CSV while keeping application code clean. Developers can easily map headers, skip unwanted rows, quarantine problematic data, and transform values on the fly without building custom post-processing pipelines. See [Real-World CSV Files](docs/real_world_csv.md) for a comprehensive guide to production CSV patterns.

Expand All @@ -33,22 +37,33 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable

For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.

| Comparison (SmarterCSV 1.16.0, C-accelerated) | Range |
| Comparison (SmarterCSV 1.17.0, C-accelerated) | Range |
|-------------------------------------------------|-------------------------|
| vs SmarterCSV 1.15.2 (with C acceleration) | up to 2.4× faster |
| vs SmarterCSV 1.14.4 (with C acceleration) | 9×–65× faster |
| vs SmarterCSV 1.14.4 (Ruby path) | 1.7×–10.6× faster |
| vs CSV.read (arrays of arrays) | 1.7×–8.6× faster |
| vs CSV.table (arrays of hashes) | 7×–129× faster |
| vs ZSV (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |
| vs SmarterCSV 1.15.2 (with C acceleration) | up to 2.8× faster |
| vs SmarterCSV 1.14.4 (with C acceleration) | 9×–82× faster |
| vs SmarterCSV 1.14.4 (Ruby path) | 2.4×–19.8× faster |
| vs CSV.read (arrays of arrays) | 1.3×–7.9× faster |
| vs CSV.table (arrays of hashes) | 4.9×–132× faster |
| vs ZSV 1.3.0 (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |

† SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output. ZSV row carried over from the 1.16.0 benchmark; not re-measured for 1.17.0.

_Benchmarks: 19 CSV files (20k–240k rows), Ruby 3.4.7, Apple M4._

† SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output.
> ⁉️ **Why these numbers look a touch lower than 1.16.0 charts?**
> TL;DR: because we use different statistic methods.
>
> Earlier versions of these benchmarks reported the best-of-N sample (the absolute `min` / fastest run) for each measurement. A single lucky run — empty caches lining up, no scheduler interrupts — could shave up to ~10% off and become the headline number. I think that would be misleading.
> Because of that, we've switched to the 10th-percentile (`p10`) of multiple runs of 40 samples, which discards roughly the four luckiest runs and reports a time much closer to what you'll actually observe in production. On noisier fixtures `p10` is ~5–10% above `min`; on quiet ones it's within 1%. The relative ordering between versions and adapters is unchanged; the absolute speedup figures are simply more honest.

_Benchmarks: 19 CSV files (20k–80k rows), Ruby 3.4.7, Apple M1._
### SmarterCSV vs Ruby CSV
![SmarterCSV 1.17.0 vs Ruby CSV 3.3.5 speedup](images/SmarterCSV_1.17.0_vs_RubyCSV_3.3.5_speedup.svg)

![SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup](images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png)
### SmarterCSV C Path
![SmarterCSV 1.17.0 vs previous versions — C-accelerated path](images/SmarterCSV_1.17.0_vs_previous_C-speedup.svg)

![SmarterCSV 1.16.0 vs previous versions — C-accelerated path](images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg)
### SmarterCSV Ruby Path
![SmarterCSV 1.17.0 vs previous versions — Ruby path](images/SmarterCSV_1.17.0_vs_previous_Rb-speedup.svg)

See [SmarterCSV 1.15.2: Faster Than Raw CSV Arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for more details.

Expand All @@ -61,7 +76,7 @@ It's a one-line change:
# Before
rows = CSV.table('data.csv').map(&:to_h)

# After — up to 129× faster, same symbol keys
# After — up to 132× faster, same symbol keys
rows = SmarterCSV.process('data.csv')
```

Expand Down Expand Up @@ -124,6 +139,23 @@ strip_whitespace → nil_values_matching → remove_empty_values → remove_zero

Each step is individually configurable. See [Data Transformations](docs/data_transformations.md) and [Value Converters](docs/value_converters.md) for details.

### Value Converters

Per-column lambdas convert raw strings into typed values — dates, currency, booleans:

```ruby
require 'date'

data = SmarterCSV.process('orders.csv',
value_converters: {
dob: ->(v) { v && Date.strptime(v, '%m/%d/%Y') },
price: ->(v) { v&.delete('$,')&.to_f },
active: ->(v) { v&.match?(/\Atrue\z/i) },
})
```

See [Value Converters](docs/value_converters.md).

### Batch Processing:

Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
Expand All @@ -147,6 +179,8 @@ SmarterCSV.process(filename, chunk_size: 100) do |chunk|
end
```

See [Batch Processing](docs/batch_processing.md) for chunk sizing, `each_chunk`, and parallel-worker patterns.

### Modern Enumerator API:

`Reader#each` is the modern, idiomatic way to process rows — `Reader` includes `Enumerable`, so all standard Ruby methods work:
Expand All @@ -166,6 +200,29 @@ first_ten = reader.lazy.select { |h| h[:active] }.first(10)
reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
```

See [The Basic Read API](docs/basic_read_api.md) for the full `Reader` interface.

### Streaming / Non-Seekable Inputs (1.17.0+):

SmarterCSV reads directly from any IO — no need to materialize the file on disk first. Auto-detection works on streaming inputs without rewinding; the first chunk is buffered transparently.

```ruby
# Gzipped CSV — stream-decompressed, never written to disk
require 'zlib'
Zlib::GzipReader.open('huge.csv.gz') do |io|
SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
end

# STDIN / pipes
SmarterCSV.process($stdin) { |row, _| ... }

# HTTP response body
require 'open-uri'
URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
```

See [Row and Column Separators](docs/row_col_sep.md) for how `:auto` detection works on non-seekable streams, and [Configuration Options](docs/options.md) for `buffer_size` (the peek-buffer chunk size).

### Bad Row Handling:

SmarterCSV can quarantine malformed rows instead of crashing the entire import:
Expand All @@ -182,7 +239,33 @@ end

See [Bad Row Quarantine](docs/bad_row_quarantine.md) for full details including `bad_row_limit` and `field_size_limit`.

See [13 Examples](docs/examples.md) for more, including value converters, header validation, writing CSV, encoding handling, and resumable Rails ActiveJob imports.
### Header Validation:

Raise early if the file is missing required columns, before any data row is processed:

```ruby
begin
SmarterCSV.process('transactions.csv',
required_keys: [:account_id, :amount, :currency])
rescue SmarterCSV::MissingKeys => e
abort "CSV missing columns: #{e.keys.join(', ')}"
end
```

See [Header Validations](docs/header_validations.md).

### Writing CSV:

```ruby
SmarterCSV.generate('output.csv') do |csv|
csv << { name: 'Alice', age: 30, city: 'New York' }
csv << { name: 'Bob', age: 25, city: 'Chicago' }
end
```

Hashes (not arrays) make column-shift bugs impossible — adding a column never silently misaligns existing rows. See [The Basic Write API](docs/basic_write_api.md) for header renaming, value converters, and ordered output.

See [18 Examples](docs/examples.md) for more, including encoding and preamble handling, key mapping, instrumentation hooks, and resumable Rails ActiveJob imports.

## Requirements

Expand Down Expand Up @@ -223,6 +306,7 @@ Or install it yourself as:
* [Data Transformations](docs/data_transformations.md)
* [Value Converters](docs/value_converters.md)
* [Bad Row Quarantine](docs/bad_row_quarantine.md)
* [Warnings](docs/warnings.md)
* [Instrumentation Hooks](docs/instrumentation.md)
* [Examples](docs/examples.md)
* [Real-World CSV Files](docs/real_world_csv.md)
Expand Down
Loading