Skip to content

Avoid some copies in Arrow IPC#10044

Open
Rich-T-kid wants to merge 15 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/optimize-arrow-ipc-copies
Open

Avoid some copies in Arrow IPC#10044
Rich-T-kid wants to merge 15 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/optimize-arrow-ipc-copies

Conversation

@Rich-T-kid
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Compression is the most compute and memory intensive part of the arrow-ipc encoding pipeline. It runs per buffer, not per record batch. For a Flight stream of 10 batches with 5 primitive arrays each, that is 100 compression calls minimum, more for string and struct arrays. Each of those calls produced an owned compressed Vec that was then copied a second
time into a flat arrow_data accumulator before being written to the output. For the uncompressed path the situation was the same: Arc-backed buffer slices that required no compression were still copied into that accumulator unnecessarily.

Separately, the original write_message() function flushed after every dictionary and every record batch, causing repeated small OS write calls per batch.
The goal was to eliminate both problems: stop copying buffers that do not need to be copied, and stop flushing on every message.

What changes are included in this PR?

  • Introduced EncodedBuffer, an enum that wraps either a raw Arc-backed Buffer for the uncompressed path or an owned Vec for the compressed path, so both can be held in a uniform collection without an extra copy into a flat accumulator
  • Changed write_array_data to push EncodedBuffer segments instead of copying bytes into arrow_data
  • Added write_batch_direct on IpcDataGenerator which writes the FlatBuffer metadata header first, then streams each EncodedBuffer segment directly to the writer with per-buffer alignment padding, never assembling an intermediate flat Vec for the body
  • FileWriter and StreamWriter both now call write_batch_direct(), eliminating the flush-per-message behavior and the intermediate copy on the hot path

Are these changes tested?

These changes are intended to be completely seamless. I didn't write new unit test for the code as nothing externally changed. all test still pass

benchmarks

[main -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100]
[my branch -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100 ]
Image 6-1-26 at 3 19 PM

[main -> cargo bench --bench ipc_writer -- --sample-size 1000]
[my branch -> cargo bench --bench ipc_writer -- --sample-size 1000]
Image 6-1-26 at 3 20 PM

Are there any user-facing changes?

no

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jun 1, 2026
Comment thread arrow-ipc/src/writer.rs Outdated
Copy link
Copy Markdown
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good. Good job! left some comments mainly directed towards exploring more reuse and bringing a bit more clarity to this file, let me know if you have other ideas.

Comment thread arrow-ipc/src/writer.rs Outdated
Comment thread arrow-ipc/src/writer.rs Outdated
Comment thread arrow-ipc/src/writer.rs Outdated
}

let (encoded_dictionaries, encoded_message) = self.data_gen.encode(
let (dict_sizes, (meta, data)) = self.data_gen.write_batch_direct(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two last (meta, data) fields returned by write_batch_direct are named (aligned_size, body_len) in that function.

Is this correct? not sure if this is just a naming thing, but it's hard to know if this is correct given the different naming. Is meta == aligned_size and data == body_len? they sound like completely different things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the struct & variable names to hopefully make this clearer.

Comment thread arrow-ipc/src/writer.rs Outdated
/// each buffer is compressed into a per-buffer scratch `Vec<u8>` and written from
/// there, eliminating the extra copy that `write_buffer` -> `arrow_data` ->
/// `write_body_buffers` would otherwise incur.
fn write_batch_direct<W: Write>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see most contents of this function are essentially copy-pastes from record_batch_to_bytes, duplication seems too much here. Is there any chance to:

  • Completely replace record_batch_to_bytes and keep just a single function for writing batches in IPC format
  • Factoring out some ergonomic helpers that could be reused in both functions?

Also, it seems like the .encode() method and the new .write_batch_direct() are both doing the same thing with slightly different ergonomics. Do you see any opportunity to collapse them into just 1 method?

This file is overall pretty bloated with complex logic and a relatively arbitrary separation of concerns between methods, the more we can do for debloating it the better it will be for future maintainers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. the main issue is that FileWriter needs metadata while both StreamWriter and arrow-flight do not. Its better to not compute metadata the caller will not use but the slowdown should be negligible.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 3, 2026

Benchmark results from #10031

ran cargo bench --bench flight encode -- --sample-size 100
Image 6-3-26 at 1 49 PM
Image 6-3-26 at 1 49 PM
Image 6-3-26 at 1 49 PM (1)

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

going to look into why fixed/9182x1 regressed. Might just be noise

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from fc1dbb8 to ecb37f2 Compare June 3, 2026 18:46
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

I think its worth mentioning that no dictionary optimizations were made in the PR, could make to make that a follow up ticket.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Ran benchmarks again and the results still look good. Test are passing for arrow-flight & arrow-ipc.
@alamb could you run the benchmarks for this PR on the CI bot when you get a chance? thank you!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 4, 2026

run benchmark flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621154654-431-g2mb7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.02    274.7±1.07µs   915.2 MB/sec    1.00    270.3±1.55µs   930.2 MB/sec
encode/dict/65536x8           1.30      5.5±0.04ms   365.9 MB/sec    1.00      4.2±0.06ms   474.4 MB/sec
encode/dict/8192x1            1.01     35.3±0.04µs   924.9 MB/sec    1.00     35.1±0.05µs   930.1 MB/sec
encode/dict/8192x8            1.10    315.3±1.66µs   828.7 MB/sec    1.00    286.1±2.01µs   913.3 MB/sec
encode/fixed/65536x1          1.00     10.0±0.02µs    48.8 GB/sec    1.00     10.0±0.04µs    49.1 GB/sec
encode/fixed/65536x8          1.01   1092.1±2.65µs     3.6 GB/sec    1.00   1076.9±5.83µs     3.6 GB/sec
encode/fixed/8192x1           1.03      3.2±0.01µs    19.3 GB/sec    1.00      3.1±0.02µs    19.8 GB/sec
encode/fixed/8192x8           1.07     17.5±0.04µs    27.9 GB/sec    1.00     16.4±0.07µs    29.8 GB/sec
encode/nested/65536x1         1.00     29.0±0.24µs    42.1 GB/sec    1.04     30.2±0.30µs    40.4 GB/sec
encode/nested/65536x8         1.12      2.5±0.08ms     3.9 GB/sec    1.00      2.2±0.13ms     4.4 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.5 GB/sec    1.00      5.8±0.01µs    26.6 GB/sec
encode/nested/8192x8          1.01     46.3±0.10µs    26.4 GB/sec    1.00     45.9±0.56µs    26.6 GB/sec
encode/variable/65536x1       1.04     51.5±0.55µs    42.7 GB/sec    1.00     49.3±0.36µs    44.6 GB/sec
encode/variable/65536x8       1.24      5.7±0.09ms     3.1 GB/sec    1.00      4.6±0.06ms     3.8 GB/sec
encode/variable/8192x1        1.18      7.1±0.01µs    38.8 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x8        1.32     83.3±1.96µs    26.4 GB/sec    1.00     63.3±0.20µs    34.7 GB/sec
roundtrip/dict/65536x1        1.01  1288.9±47.00µs   195.1 MB/sec    1.00  1279.9±51.20µs   196.4 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.55ms   139.8 MB/sec    1.01     14.5±0.53ms   138.4 MB/sec
roundtrip/dict/8192x1         1.00    206.6±5.60µs   158.1 MB/sec    1.00    206.0±5.73µs   158.6 MB/sec
roundtrip/dict/8192x8         1.00  1322.8±44.85µs   197.5 MB/sec    1.00  1324.0±44.79µs   197.3 MB/sec
roundtrip/fixed/65536x1       1.02    314.2±4.02µs  1591.6 MB/sec    1.00    307.7±4.21µs  1625.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1856.3 MB/sec    1.15      2.5±0.13ms  1608.9 MB/sec
roundtrip/fixed/8192x1        1.00     89.9±0.94µs   696.0 MB/sec    1.00     89.6±1.04µs   698.6 MB/sec
roundtrip/fixed/8192x8        1.01    333.8±3.84µs  1499.9 MB/sec    1.00    330.7±3.14µs  1514.2 MB/sec
roundtrip/nested/65536x1      1.00   859.6±40.22µs  1454.4 MB/sec    1.00   862.4±44.79µs  1449.6 MB/sec
roundtrip/nested/65536x8      1.01      8.6±0.36ms  1156.8 MB/sec    1.00      8.6±0.36ms  1168.2 MB/sec
roundtrip/nested/8192x1       1.01    159.3±5.82µs   981.9 MB/sec    1.00    157.6±5.86µs   993.0 MB/sec
roundtrip/nested/8192x8       1.02   925.9±41.73µs  1351.8 MB/sec    1.00   910.3±44.09µs  1374.9 MB/sec
roundtrip/variable/65536x1    1.00  1246.2±34.07µs  1805.6 MB/sec    1.54  1922.3±132.51µs  1170.5 MB/sec
roundtrip/variable/65536x8    1.17     16.1±0.50ms  1119.3 MB/sec    1.00     13.7±0.50ms  1313.5 MB/sec
roundtrip/variable/8192x1     1.00    204.0±5.40µs  1379.8 MB/sec    1.00    203.1±5.65µs  1385.4 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±26.93µs  1858.7 MB/sec    1.57  1897.2±120.59µs  1186.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 345.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 350.2s
CPU sys 74.8s
Peak spill 0 B

branch

Metric Value
Wall time 335.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 335.9s
CPU sys 78.4s
Peak spill 0 B

File an issue against this benchmark runner

@gabotechs
Copy link
Copy Markdown
Contributor

run benchmarks ipc_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621805130-432-xw4gv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                 main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                                                 ----                                   ------------------------------------
arrow_ipc_stream_writer/FileWriter/write_10           1.90    185.2±2.17µs        ? ?/sec    1.00     97.3±4.50µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         1.94    185.1±1.96µs        ? ?/sec    1.00     95.4±4.87µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.01      7.3±0.02ms        ? ?/sec    1.00      7.2±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 30.0s
Peak memory 2.7 GiB
Avg memory 2.6 GiB
CPU user 27.5s
CPU sys 0.6s
Peak spill 0 B

branch

Metric Value
Wall time 30.0s
Peak memory 2.7 GiB
Avg memory 2.6 GiB
CPU user 29.7s
CPU sys 0.1s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 4, 2026

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

@gabotechs
Copy link
Copy Markdown
Contributor

run benchmark flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4622787465-441-9sh6d 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.01    271.4±1.78µs   926.3 MB/sec    1.00    267.9±0.66µs   938.6 MB/sec
encode/dict/65536x8           1.00      5.1±0.12ms   395.2 MB/sec    1.11      5.6±0.21ms   357.0 MB/sec
encode/dict/8192x1            1.04     36.4±0.04µs   896.4 MB/sec    1.00     35.0±0.03µs   932.9 MB/sec
encode/dict/8192x8            1.06    305.3±1.34µs   855.9 MB/sec    1.00    287.6±1.80µs   908.7 MB/sec
encode/fixed/65536x1          1.04     10.3±0.02µs    47.6 GB/sec    1.00      9.9±0.02µs    49.3 GB/sec
encode/fixed/65536x8          1.00   1084.9±6.70µs     3.6 GB/sec    1.01   1096.8±4.18µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.3 GB/sec    1.03      3.1±0.01µs    19.7 GB/sec
encode/fixed/8192x8           1.06     17.4±0.02µs    28.0 GB/sec    1.00     16.4±0.06µs    29.8 GB/sec
encode/nested/65536x1         1.29     38.5±0.19µs    31.7 GB/sec    1.00     29.8±0.58µs    41.0 GB/sec
encode/nested/65536x8         1.32      2.7±0.04ms     3.6 GB/sec    1.00      2.1±0.20ms     4.8 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.02µs    26.4 GB/sec
encode/nested/8192x8          1.04     47.3±0.08µs    25.8 GB/sec    1.00     45.5±0.14µs    26.9 GB/sec
encode/variable/65536x1       1.63     80.5±0.47µs    27.3 GB/sec    1.00     49.3±0.34µs    44.6 GB/sec
encode/variable/65536x8       1.10      5.4±0.19ms     3.2 GB/sec    1.00      4.9±0.11ms     3.6 GB/sec
encode/variable/8192x1        1.81     10.7±0.01µs    25.7 GB/sec    1.00      5.9±0.01µs    46.6 GB/sec
encode/variable/8192x8        1.37     87.5±0.23µs    25.1 GB/sec    1.00     63.9±0.26µs    34.4 GB/sec
roundtrip/dict/65536x1        1.00  1319.9±43.70µs   190.5 MB/sec    1.00  1316.4±41.25µs   191.0 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.59ms   139.9 MB/sec    1.02     14.6±0.57ms   137.9 MB/sec
roundtrip/dict/8192x1         1.01    213.3±5.87µs   153.1 MB/sec    1.00    211.4±6.11µs   154.5 MB/sec
roundtrip/dict/8192x8         1.01  1346.6±44.16µs   194.0 MB/sec    1.00  1338.9±42.85µs   195.1 MB/sec
roundtrip/fixed/65536x1       1.00    318.9±4.39µs  1568.0 MB/sec    1.00    319.6±4.24µs  1564.8 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1821.4 MB/sec    1.19      2.6±0.16ms  1532.6 MB/sec
roundtrip/fixed/8192x1        1.00     93.5±1.03µs   669.7 MB/sec    1.01     94.6±1.56µs   661.8 MB/sec
roundtrip/fixed/8192x8        1.01    342.0±6.09µs  1464.3 MB/sec    1.00    338.7±4.37µs  1478.2 MB/sec
roundtrip/nested/65536x1      1.01   885.1±37.56µs  1412.4 MB/sec    1.00   879.6±39.35µs  1421.4 MB/sec
roundtrip/nested/65536x8      1.00     10.0±0.39ms  1002.2 MB/sec    1.02     10.2±0.43ms   979.4 MB/sec
roundtrip/nested/8192x1       1.01    163.0±5.56µs   960.0 MB/sec    1.00    161.1±5.13µs   971.1 MB/sec
roundtrip/nested/8192x8       1.00   927.1±39.57µs  1350.1 MB/sec    1.02   941.8±46.61µs  1329.0 MB/sec
roundtrip/variable/65536x1    1.00  1285.3±51.72µs  1750.7 MB/sec    1.49  1910.5±128.29µs  1177.8 MB/sec
roundtrip/variable/65536x8    1.00     14.8±0.74ms  1217.3 MB/sec    1.01     15.0±0.53ms  1199.3 MB/sec
roundtrip/variable/8192x1     1.00    210.4±5.49µs  1337.6 MB/sec    1.00    209.4±6.81µs  1344.0 MB/sec
roundtrip/variable/8192x8     1.00  1251.7±29.66µs  1798.6 MB/sec    1.53  1914.9±122.75µs  1175.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 345.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 352.2s
CPU sys 71.6s
Peak spill 0 B

branch

Metric Value
Wall time 340.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 333.1s
CPU sys 84.7s
Peak spill 0 B

File an issue against this benchmark runner

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 4, 2026

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

Looks like the results are reproducable -- next step would be to profile it to see if you can find the answer

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies 🤔

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 4, 2026

i'm suspecting the issue has to do with
let (client, server) = tokio::io::duplex(1024 * 1024);
per the docs "The max_buf_size argument is the maximum amount of bytes that can be written to a side before the write returns Poll::Pending."

The two regression cases both involve large variable-length data where the encoded payload can be huge:
roundtrip/variable/8192x8 — 8 columns × 8192 rows
roundtrip/variable/65536x1 — 65536 rows, large values buffer

This also shows up in the regression cases,
roundtrip/variable/8192x8 1.00 1251.7±29.66µs 1798.6 MB/sec 1.53 1914.9±122.75µs 1175.7 MB/sec
roundtrip/variable/65536x1 1.00 1285.3±51.72µs 1750.7 MB/sec 1.49 1910.5±128.29µs 1177.8 MB/sec
throughput falls flat.

taking a look at the other benchmark results this seems consistant,
roundtrip/fixed/65536x8 1.00 2.2±0.03ms 1821.4 MB/sec 1.19 2.6±0.16ms 1532.6 MB/sec throughput shrinks and as such causes more blocking to happen.

Even in the event where this isn't the reason for the slow down I think 1MB is still to small for realistic max throughput.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies

I think this is a strong possibility after looking at the profile, from my understanding this is mostly in arrow-flight itself and not arrow-ipc. Since arrow-flight is very dependent on arrow-ipc it make sense to start from the ground up with these optimizations.

alamb pushed a commit that referenced this pull request Jun 4, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->
- Closes #10029.

# Rationale for this change
Increase the duplex buffer from 1 MB to 64 MB to eliminate artificial
back-pressure in the roundtrip benchmarks.
See rational in this
[comment](#10044 (comment))
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
bumps `max_buf_size` to 64**MB**
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
n/a
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
n/a
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
@alamb alamb changed the title Rich t kid/optimize arrow ipc copies Avoid Arrow IPC Copies Jun 5, 2026
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 5, 2026

I've been working on this all day and any further changes risk scope creep, so I'd like to split the IPC StreamWriter/FileWriter improvements and the arrow-flight work into separate PRs. since benchmarks for those look good & they are unrelated to the arrow-flight work.
Due to async polling it's hard to distinguish what copies happen at the tonic level versus in arrow-flight. I've been profiling this locally with an additional encode_to_send benchmark that measures the full path via do_put [will include in follow up PR] @alamb does that sound like a good idea?
Image 6-5-26 at 4 58 PM
Image 6-5-26 at 4 57 PM (1)
Image 6-5-26 at 4 57 PM
One question: does anyone have insight into why the benchmarks behave differently on the CI workers versus locally? I saw something similar in distributed DataFusion and it came down to thread count, but beyond thread count and CPU cache size I'm not sure what else could explain the difference at this scale.

@github-actions github-actions Bot added the arrow-flight Changes to the arrow-flight crate label Jun 7, 2026
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Image 6-7-26 at 4 16 PM results here seem in line with [this](https://github.com//pull/10044#issuecomment-4621838941)

I removed any logic that was touching arrow-flights path. This PR focuses on removing intermediary buffer allocations for the StreamWriter and FIleWriter. Instead of accumulating all buffer bytes into a heap allocation before writing, buffer pointers are collected during encoding and streamed directly to the underlying writer once the FlatBuffer header is built.

@github-actions github-actions Bot removed the arrow-flight Changes to the arrow-flight crate label Jun 7, 2026
Comment thread arrow-ipc/src/writer.rs
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

The diff looks larger than it is due to estimate_encoded_buffer_count() (which estimates how many slots to pre-allocate for the buffer vector) and the if statements that determine whether write_buffers() or collect_encoded_buffers() is called based on whether a Writer was supplied. This avoids duplicating logic by reusing shared functionality across both code paths.
@gabotechs could you take another look at this when you get a chance 🚀

Comment thread arrow-ipc/src/writer.rs
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from cbc1d46 to ce6b828 Compare June 8, 2026 00:03
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from ce6b828 to 095d7fd Compare June 8, 2026 00:05
@gabotechs
Copy link
Copy Markdown
Contributor

run benchmarks flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4645915589-475-qqxw6 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (a3f9c53) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    269.0±1.27µs   934.5 MB/sec    1.05    282.8±1.13µs   889.2 MB/sec
encode/dict/65536x8           1.00      4.0±0.03ms   499.8 MB/sec    1.43      5.7±0.03ms   350.5 MB/sec
encode/dict/8192x1            1.00     35.5±0.43µs   920.9 MB/sec    1.03     36.6±0.04µs   891.5 MB/sec
encode/dict/8192x8            1.01    306.2±4.60µs   853.4 MB/sec    1.00    301.9±2.32µs   865.4 MB/sec
encode/fixed/65536x1          1.00      9.9±0.02µs    49.5 GB/sec    1.03     10.1±0.02µs    48.3 GB/sec
encode/fixed/65536x8          1.03   1121.5±1.42µs     3.5 GB/sec    1.00   1087.5±2.55µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.2±0.01µs    19.3 GB/sec    1.00      3.2±0.01µs    19.4 GB/sec
encode/fixed/8192x8           1.00     17.6±0.03µs    27.7 GB/sec    1.01     17.8±0.05µs    27.5 GB/sec
encode/nested/65536x1         1.00     37.5±0.19µs    32.6 GB/sec    1.15     42.9±0.36µs    28.4 GB/sec
encode/nested/65536x8         1.15      3.0±0.01ms     3.3 GB/sec    1.00      2.6±0.01ms     3.8 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.4 GB/sec    1.02      5.9±0.01µs    26.0 GB/sec
encode/nested/8192x8          1.00     46.5±0.08µs    26.3 GB/sec    1.05     49.1±0.13µs    24.9 GB/sec
encode/variable/65536x1       1.00     72.7±0.39µs    30.2 GB/sec    1.12     81.5±0.33µs    27.0 GB/sec
encode/variable/65536x8       1.00      5.0±0.03ms     3.5 GB/sec    1.12      5.6±0.05ms     3.1 GB/sec
encode/variable/8192x1        1.00      6.8±0.02µs    40.5 GB/sec    1.03      7.0±0.01µs    39.1 GB/sec
encode/variable/8192x8        1.00     80.9±0.16µs    27.2 GB/sec    1.02     82.7±0.21µs    26.6 GB/sec
roundtrip/dict/65536x1        1.00  1281.9±43.73µs   196.1 MB/sec    1.01  1300.0±43.37µs   193.4 MB/sec
roundtrip/dict/65536x8        1.02     14.7±0.61ms   137.1 MB/sec    1.00     14.3±0.60ms   140.4 MB/sec
roundtrip/dict/8192x1         1.00    204.0±5.97µs   160.1 MB/sec    1.03    209.5±5.67µs   155.9 MB/sec
roundtrip/dict/8192x8         1.00  1333.1±41.92µs   196.0 MB/sec    1.01  1350.3±43.61µs   193.5 MB/sec
roundtrip/fixed/65536x1       1.00    311.1±3.95µs  1607.4 MB/sec    1.01    315.5±3.73µs  1585.1 MB/sec
roundtrip/fixed/65536x8       1.00      2.1±0.02ms  1873.4 MB/sec    1.01      2.1±0.03ms  1861.8 MB/sec
roundtrip/fixed/8192x1        1.00     91.2±1.40µs   686.4 MB/sec    1.00     91.4±0.86µs   684.5 MB/sec
roundtrip/fixed/8192x8        1.00    328.6±4.67µs  1524.1 MB/sec    1.02    336.4±3.39µs  1488.6 MB/sec
roundtrip/nested/65536x1      1.00   849.6±40.97µs  1471.5 MB/sec    1.01   857.6±41.57µs  1457.7 MB/sec
roundtrip/nested/65536x8      1.00      8.3±0.34ms  1200.9 MB/sec    1.16      9.7±0.36ms  1034.4 MB/sec
roundtrip/nested/8192x1       1.00    158.6±5.50µs   986.4 MB/sec    1.01    159.9±5.60µs   978.3 MB/sec
roundtrip/nested/8192x8       1.00   904.4±45.78µs  1383.9 MB/sec    1.00   904.4±43.74µs  1383.9 MB/sec
roundtrip/variable/65536x1    1.00  1212.7±41.75µs  1855.5 MB/sec    1.01  1224.4±31.76µs  1837.8 MB/sec
roundtrip/variable/65536x8    1.00     16.1±0.50ms  1120.5 MB/sec    1.02     16.3±0.57ms  1101.1 MB/sec
roundtrip/variable/8192x1     1.00    204.7±5.69µs  1374.6 MB/sec    1.02    209.1±5.61µs  1345.9 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±21.91µs  1858.7 MB/sec    1.01  1219.1±27.99µs  1846.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 335.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 348.5s
CPU sys 65.9s
Peak spill 0 B

branch

Metric Value
Wall time 340.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 346.0s
CPU sys 71.2s
Peak spill 0 B

File an issue against this benchmark runner

Comment thread arrow-ipc/src/writer.rs Outdated
ipc_message: finished_data.to_vec(),
arrow_data,
})
if let Some(w) = writer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find another way of modeling this without falling into "if-driven-development" patterns.

Whenever you find yourself coding something switching execution branches with completely different bodies over booleans, it's an indicator that you are falling into an "if-driven-development" pattern, and this is will greatly hurt maintenance in the future.

One way of trying to improve this situation can be:

  1. Refactor things preferring code duplication, and split into different functions with different responsibilities, even if that implies copy-pasting big quantities of code.
  2. Once you have the different functions with a fair amount of copy-pasted code, try to see what are the common bits, and factor them out little by little, trying to progressively reduce the LOC count in each one.
  3. If you still see functions that need to accept bool or Option parameters that have the capability of switching how the function behaves, that means the function was the wrong abstraction on the first place, so don't be afraid of tearing existing functions apart if that allows you to reuse smaller bits in different places without switching on if statements.

Comment thread arrow-ipc/src/writer.rs Outdated
Comment on lines 1943 to 1954
arrow_data: &mut Vec<u8>,
encoded_buffers: &mut Vec<EncodedBuffer>,
nodes: &mut Vec<crate::FieldNode>,
offset: i64,
num_rows: usize,
null_count: usize,
compression_codec: Option<CompressionCodec>,
compression_context: &mut CompressionContext,
write_options: &IpcWriteOptions,
is_direct: bool,
) -> Result<i64, ArrowError> {
let mut offset = offset;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of my comment above, but this time reinforced by a big Clippy bypass at the top.

Ideally, we should not be contributing towards stronger Clippy violations. If you try to follow the steps above, there are good chances that we can either maintain the width of this function's signature, or even reduce it.

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 9ec873d to 83102e2 Compare June 8, 2026 18:07
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 8, 2026

Replaced the (arrow_data(), encoded_buffers(), is_direct()) triple threaded through write_array_data() with a single BufferSink passed by the caller, eliminating every if is_direct { collect_encoded_buffers(...) } else { write_buffer(...) } branch that was repeated at each buffer write site. All path-specific logic is now encapsulated once in encode_sink_buffer(), which write_array_data() calls uniformly regardless of whether the output is headed for an in-memory EncodedData or a direct stream write. About -400 LOC @gabotechs

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 83102e2 to 70a7567 Compare June 8, 2026 18:14
@gabotechs
Copy link
Copy Markdown
Contributor

run benchmarks flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4652613611-485-hmsf5 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (70a7567) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    285.2±1.80µs   881.4 MB/sec    1.00    285.3±1.50µs   881.1 MB/sec
encode/dict/65536x8           1.34      9.6±0.14ms   209.0 MB/sec    1.00      7.2±0.15ms   280.1 MB/sec
encode/dict/8192x1            1.05     37.1±0.06µs   881.5 MB/sec    1.00     35.2±0.11µs   926.7 MB/sec
encode/dict/8192x8            1.00    305.6±1.28µs   854.9 MB/sec    1.03    315.1±1.20µs   829.1 MB/sec
encode/fixed/65536x1          1.01     10.4±0.04µs    47.1 GB/sec    1.00     10.2±0.01µs    47.8 GB/sec
encode/fixed/65536x8          1.00   1085.1±7.18µs     3.6 GB/sec    1.03   1112.7±9.29µs     3.5 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.1 GB/sec    1.06      3.2±0.01µs    18.9 GB/sec
encode/fixed/8192x8           1.02     17.8±0.04µs    27.5 GB/sec    1.00     17.5±0.03µs    28.0 GB/sec
encode/nested/65536x1         1.35     38.4±0.26µs    31.8 GB/sec    1.00     28.5±0.18µs    42.8 GB/sec
encode/nested/65536x8         1.00      3.1±0.09ms     3.1 GB/sec    1.03      3.2±0.08ms     3.0 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.01µs    26.5 GB/sec
encode/nested/8192x8          1.00     45.8±0.19µs    26.7 GB/sec    1.05     48.1±0.14µs    25.4 GB/sec
encode/variable/65536x1       1.00     62.9±0.37µs    35.0 GB/sec    1.11     69.6±0.43µs    31.6 GB/sec
encode/variable/65536x8       1.00      5.7±0.24ms     3.1 GB/sec    1.03      5.9±0.16ms     3.0 GB/sec
encode/variable/8192x1        1.00      7.1±0.01µs    38.4 GB/sec    1.00      7.1±0.02µs    38.5 GB/sec
encode/variable/8192x8        1.00     79.7±0.58µs    27.6 GB/sec    1.00     79.6±0.26µs    27.6 GB/sec
roundtrip/dict/65536x1        1.01  1319.3±49.97µs   190.6 MB/sec    1.00  1311.8±42.32µs   191.7 MB/sec
roundtrip/dict/65536x8        1.00     15.0±0.58ms   134.3 MB/sec    1.07     16.0±0.67ms   125.9 MB/sec
roundtrip/dict/8192x1         1.00    207.5±5.56µs   157.4 MB/sec    1.00    208.5±5.51µs   156.7 MB/sec
roundtrip/dict/8192x8         1.00  1346.9±43.49µs   194.0 MB/sec    1.00  1345.0±45.04µs   194.3 MB/sec
roundtrip/fixed/65536x1       1.00    305.0±4.83µs  1639.8 MB/sec    1.03    314.3±3.73µs  1591.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.04ms  1791.3 MB/sec    1.00      2.2±0.05ms  1799.0 MB/sec
roundtrip/fixed/8192x1        1.00     90.4±1.16µs   692.6 MB/sec    1.01     91.7±1.52µs   682.7 MB/sec
roundtrip/fixed/8192x8        1.01    329.1±3.56µs  1521.5 MB/sec    1.00    327.4±4.78µs  1529.3 MB/sec
roundtrip/nested/65536x1      1.00   871.2±51.26µs  1435.0 MB/sec    1.00   868.4±50.87µs  1439.7 MB/sec
roundtrip/nested/65536x8      1.00     10.1±0.67ms   992.0 MB/sec    1.05     10.6±0.49ms   946.4 MB/sec
roundtrip/nested/8192x1       1.00    158.4±5.60µs   987.7 MB/sec    1.00    158.5±5.00µs   987.2 MB/sec
roundtrip/nested/8192x8       1.00   916.8±42.83µs  1365.2 MB/sec    1.00   915.9±45.38µs  1366.6 MB/sec
roundtrip/variable/65536x1    1.00  1247.8±36.58µs  1803.3 MB/sec    1.05  1304.0±100.89µs  1725.6 MB/sec
roundtrip/variable/65536x8    1.00     16.5±0.56ms  1092.2 MB/sec    1.04     17.1±0.62ms  1051.2 MB/sec
roundtrip/variable/8192x1     1.00    206.9±6.52µs  1360.3 MB/sec    1.01    209.6±6.14µs  1342.6 MB/sec
roundtrip/variable/8192x8     1.00  1237.0±25.68µs  1820.1 MB/sec    1.00  1241.4±39.06µs  1813.6 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 345.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 348.1s
CPU sys 76.0s
Peak spill 0 B

branch

Metric Value
Wall time 335.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 339.6s
CPU sys 73.9s
Peak spill 0 B

File an issue against this benchmark runner

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 8, 2026

I am excited by this one -- I am starting to check it out

@alamb alamb changed the title Avoid Arrow IPC Copies Avoid some copies in Arrow IPC Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize arrow-ipc

4 participants