Avoid some copies in Arrow IPC by Rich-T-kid · Pull Request #10044 · apache/arrow-rs

Rich-T-kid · 2026-06-01T19:27:04Z

Which issue does this PR close?

Closes Optimize arrow-ipc #10029.
A document that provides a bit of context

Rationale for this change

Compression is the most compute and memory intensive part of the arrow-ipc encoding pipeline. It runs per buffer, not per record batch. For a Flight stream of 10 batches with 5 primitive arrays each, that is 100 compression calls minimum, more for string and struct arrays. Each of those calls produced an owned compressed Vec that was then copied a second
time into a flat arrow_data accumulator before being written to the output. For the uncompressed path the situation was the same: Arc-backed buffer slices that required no compression were still copied into that accumulator unnecessarily.

Separately, the original write_message() function flushed after every dictionary and every record batch, causing repeated small OS write calls per batch.
The goal was to eliminate both problems: stop copying buffers that do not need to be copied, and stop flushing on every message.

What changes are included in this PR?

Introduced EncodedBuffer, an enum that wraps either a raw Arc-backed Buffer for the uncompressed path or an owned Vec for the compressed path, so both can be held in a uniform collection without an extra copy into a flat accumulator
Changed write_array_data to push EncodedBuffer segments instead of copying bytes into arrow_data
Added write_batch_direct on IpcDataGenerator which writes the FlatBuffer metadata header first, then streams each EncodedBuffer segment directly to the writer with per-buffer alignment padding, never assembling an intermediate flat Vec for the body
FileWriter and StreamWriter both now call write_batch_direct(), eliminating the flush-per-message behavior and the intermediate copy on the hot path

Are these changes tested?

These changes are intended to be completely seamless. I didn't write new unit test for the code as nothing externally changed. all test still pass

benchmarks

[main -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100]
[my branch -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100 ]

[main -> cargo bench --bench ipc_writer -- --sample-size 1000]
[my branch -> cargo bench --bench ipc_writer -- --sample-size 1000]

Are there any user-facing changes?

no

gabotechs

This is looking pretty good. Good job! left some comments mainly directed towards exploring more reuse and bringing a bit more clarity to this file, let me know if you have other ideas.

gabotechs · 2026-06-03T14:20:29Z

        }

-        let (encoded_dictionaries, encoded_message) = self.data_gen.encode(
+        let (dict_sizes, (meta, data)) = self.data_gen.write_batch_direct(


The two last (meta, data) fields returned by write_batch_direct are named (aligned_size, body_len) in that function.

Is this correct? not sure if this is just a naming thing, but it's hard to know if this is correct given the different naming. Is meta == aligned_size and data == body_len? they sound like completely different things.

I updated the struct & variable names to hopefully make this clearer.

gabotechs · 2026-06-03T16:23:59Z

+    /// each buffer is compressed into a per-buffer scratch `Vec<u8>` and written from
+    /// there, eliminating the extra copy that `write_buffer` -> `arrow_data` ->
+    /// `write_body_buffers` would otherwise incur.
+    fn write_batch_direct<W: Write>(


I see most contents of this function are essentially copy-pastes from record_batch_to_bytes, duplication seems too much here. Is there any chance to:

Completely replace record_batch_to_bytes and keep just a single function for writing batches in IPC format

Factoring out some ergonomic helpers that could be reused in both functions?

Also, it seems like the .encode() method and the new .write_batch_direct() are both doing the same thing with slightly different ergonomics. Do you see any opportunity to collapse them into just 1 method?

This file is overall pretty bloated with complex logic and a relatively arbitrary separation of concerns between methods, the more we can do for debloating it the better it will be for future maintainers.

This makes sense to me. the main issue is that FileWriter needs metadata while both StreamWriter and arrow-flight do not. Its better to not compute metadata the caller will not use but the slowdown should be negligible.

Rich-T-kid · 2026-06-03T17:50:51Z

Benchmark results from #10031

ran cargo bench --bench flight encode -- --sample-size 100

Rich-T-kid · 2026-06-03T17:53:05Z

going to look into why fixed/9182x1 regressed. Might just be noise

Rich-T-kid · 2026-06-03T18:49:45Z

I think its worth mentioning that no dictionary optimizations were made in the PR, could make to make that a follow up ticket.

Rich-T-kid · 2026-06-03T19:03:14Z

Ran benchmarks again and the results still look good. Test are passing for arrow-flight & arrow-ipc.
@alamb could you run the benchmarks for this PR on the CI bot when you get a chance? thank you!

alamb · 2026-06-04T10:11:54Z

run benchmark flight

adriangbot · 2026-06-04T10:14:11Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621154654-431-g2mb7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-04T10:26:39Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.02    274.7±1.07µs   915.2 MB/sec    1.00    270.3±1.55µs   930.2 MB/sec
encode/dict/65536x8           1.30      5.5±0.04ms   365.9 MB/sec    1.00      4.2±0.06ms   474.4 MB/sec
encode/dict/8192x1            1.01     35.3±0.04µs   924.9 MB/sec    1.00     35.1±0.05µs   930.1 MB/sec
encode/dict/8192x8            1.10    315.3±1.66µs   828.7 MB/sec    1.00    286.1±2.01µs   913.3 MB/sec
encode/fixed/65536x1          1.00     10.0±0.02µs    48.8 GB/sec    1.00     10.0±0.04µs    49.1 GB/sec
encode/fixed/65536x8          1.01   1092.1±2.65µs     3.6 GB/sec    1.00   1076.9±5.83µs     3.6 GB/sec
encode/fixed/8192x1           1.03      3.2±0.01µs    19.3 GB/sec    1.00      3.1±0.02µs    19.8 GB/sec
encode/fixed/8192x8           1.07     17.5±0.04µs    27.9 GB/sec    1.00     16.4±0.07µs    29.8 GB/sec
encode/nested/65536x1         1.00     29.0±0.24µs    42.1 GB/sec    1.04     30.2±0.30µs    40.4 GB/sec
encode/nested/65536x8         1.12      2.5±0.08ms     3.9 GB/sec    1.00      2.2±0.13ms     4.4 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.5 GB/sec    1.00      5.8±0.01µs    26.6 GB/sec
encode/nested/8192x8          1.01     46.3±0.10µs    26.4 GB/sec    1.00     45.9±0.56µs    26.6 GB/sec
encode/variable/65536x1       1.04     51.5±0.55µs    42.7 GB/sec    1.00     49.3±0.36µs    44.6 GB/sec
encode/variable/65536x8       1.24      5.7±0.09ms     3.1 GB/sec    1.00      4.6±0.06ms     3.8 GB/sec
encode/variable/8192x1        1.18      7.1±0.01µs    38.8 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x8        1.32     83.3±1.96µs    26.4 GB/sec    1.00     63.3±0.20µs    34.7 GB/sec
roundtrip/dict/65536x1        1.01  1288.9±47.00µs   195.1 MB/sec    1.00  1279.9±51.20µs   196.4 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.55ms   139.8 MB/sec    1.01     14.5±0.53ms   138.4 MB/sec
roundtrip/dict/8192x1         1.00    206.6±5.60µs   158.1 MB/sec    1.00    206.0±5.73µs   158.6 MB/sec
roundtrip/dict/8192x8         1.00  1322.8±44.85µs   197.5 MB/sec    1.00  1324.0±44.79µs   197.3 MB/sec
roundtrip/fixed/65536x1       1.02    314.2±4.02µs  1591.6 MB/sec    1.00    307.7±4.21µs  1625.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1856.3 MB/sec    1.15      2.5±0.13ms  1608.9 MB/sec
roundtrip/fixed/8192x1        1.00     89.9±0.94µs   696.0 MB/sec    1.00     89.6±1.04µs   698.6 MB/sec
roundtrip/fixed/8192x8        1.01    333.8±3.84µs  1499.9 MB/sec    1.00    330.7±3.14µs  1514.2 MB/sec
roundtrip/nested/65536x1      1.00   859.6±40.22µs  1454.4 MB/sec    1.00   862.4±44.79µs  1449.6 MB/sec
roundtrip/nested/65536x8      1.01      8.6±0.36ms  1156.8 MB/sec    1.00      8.6±0.36ms  1168.2 MB/sec
roundtrip/nested/8192x1       1.01    159.3±5.82µs   981.9 MB/sec    1.00    157.6±5.86µs   993.0 MB/sec
roundtrip/nested/8192x8       1.02   925.9±41.73µs  1351.8 MB/sec    1.00   910.3±44.09µs  1374.9 MB/sec
roundtrip/variable/65536x1    1.00  1246.2±34.07µs  1805.6 MB/sec    1.54  1922.3±132.51µs  1170.5 MB/sec
roundtrip/variable/65536x8    1.17     16.1±0.50ms  1119.3 MB/sec    1.00     13.7±0.50ms  1313.5 MB/sec
roundtrip/variable/8192x1     1.00    204.0±5.40µs  1379.8 MB/sec    1.00    203.1±5.65µs  1385.4 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±26.93µs  1858.7 MB/sec    1.57  1897.2±120.59µs  1186.7 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	345.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	350.2s
CPU sys	74.8s
Peak spill	0 B

branch

Metric	Value
Wall time	335.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	335.9s
CPU sys	78.4s
Peak spill	0 B

File an issue against this benchmark runner

gabotechs · 2026-06-04T11:41:52Z

run benchmarks ipc_writer

adriangbot · 2026-06-04T11:45:38Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621805130-432-xw4gv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-04T11:46:59Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                 main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                                                 ----                                   ------------------------------------
arrow_ipc_stream_writer/FileWriter/write_10           1.90    185.2±2.17µs        ? ?/sec    1.00     97.3±4.50µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         1.94    185.1±1.96µs        ? ?/sec    1.00     95.4±4.87µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.01      7.3±0.02ms        ? ?/sec    1.00      7.2±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	30.0s
Peak memory	2.7 GiB
Avg memory	2.6 GiB
CPU user	27.5s
CPU sys	0.6s
Peak spill	0 B

branch

Metric	Value
Wall time	30.0s
Peak memory	2.7 GiB
Avg memory	2.6 GiB
CPU user	29.7s
CPU sys	0.1s
Peak spill	0 B

File an issue against this benchmark runner

Rich-T-kid · 2026-06-04T13:43:36Z

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

gabotechs · 2026-06-04T13:53:21Z

run benchmark flight

adriangbot · 2026-06-04T13:56:57Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4622787465-441-9sh6d 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-04T14:09:30Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.01    271.4±1.78µs   926.3 MB/sec    1.00    267.9±0.66µs   938.6 MB/sec
encode/dict/65536x8           1.00      5.1±0.12ms   395.2 MB/sec    1.11      5.6±0.21ms   357.0 MB/sec
encode/dict/8192x1            1.04     36.4±0.04µs   896.4 MB/sec    1.00     35.0±0.03µs   932.9 MB/sec
encode/dict/8192x8            1.06    305.3±1.34µs   855.9 MB/sec    1.00    287.6±1.80µs   908.7 MB/sec
encode/fixed/65536x1          1.04     10.3±0.02µs    47.6 GB/sec    1.00      9.9±0.02µs    49.3 GB/sec
encode/fixed/65536x8          1.00   1084.9±6.70µs     3.6 GB/sec    1.01   1096.8±4.18µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.3 GB/sec    1.03      3.1±0.01µs    19.7 GB/sec
encode/fixed/8192x8           1.06     17.4±0.02µs    28.0 GB/sec    1.00     16.4±0.06µs    29.8 GB/sec
encode/nested/65536x1         1.29     38.5±0.19µs    31.7 GB/sec    1.00     29.8±0.58µs    41.0 GB/sec
encode/nested/65536x8         1.32      2.7±0.04ms     3.6 GB/sec    1.00      2.1±0.20ms     4.8 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.02µs    26.4 GB/sec
encode/nested/8192x8          1.04     47.3±0.08µs    25.8 GB/sec    1.00     45.5±0.14µs    26.9 GB/sec
encode/variable/65536x1       1.63     80.5±0.47µs    27.3 GB/sec    1.00     49.3±0.34µs    44.6 GB/sec
encode/variable/65536x8       1.10      5.4±0.19ms     3.2 GB/sec    1.00      4.9±0.11ms     3.6 GB/sec
encode/variable/8192x1        1.81     10.7±0.01µs    25.7 GB/sec    1.00      5.9±0.01µs    46.6 GB/sec
encode/variable/8192x8        1.37     87.5±0.23µs    25.1 GB/sec    1.00     63.9±0.26µs    34.4 GB/sec
roundtrip/dict/65536x1        1.00  1319.9±43.70µs   190.5 MB/sec    1.00  1316.4±41.25µs   191.0 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.59ms   139.9 MB/sec    1.02     14.6±0.57ms   137.9 MB/sec
roundtrip/dict/8192x1         1.01    213.3±5.87µs   153.1 MB/sec    1.00    211.4±6.11µs   154.5 MB/sec
roundtrip/dict/8192x8         1.01  1346.6±44.16µs   194.0 MB/sec    1.00  1338.9±42.85µs   195.1 MB/sec
roundtrip/fixed/65536x1       1.00    318.9±4.39µs  1568.0 MB/sec    1.00    319.6±4.24µs  1564.8 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1821.4 MB/sec    1.19      2.6±0.16ms  1532.6 MB/sec
roundtrip/fixed/8192x1        1.00     93.5±1.03µs   669.7 MB/sec    1.01     94.6±1.56µs   661.8 MB/sec
roundtrip/fixed/8192x8        1.01    342.0±6.09µs  1464.3 MB/sec    1.00    338.7±4.37µs  1478.2 MB/sec
roundtrip/nested/65536x1      1.01   885.1±37.56µs  1412.4 MB/sec    1.00   879.6±39.35µs  1421.4 MB/sec
roundtrip/nested/65536x8      1.00     10.0±0.39ms  1002.2 MB/sec    1.02     10.2±0.43ms   979.4 MB/sec
roundtrip/nested/8192x1       1.01    163.0±5.56µs   960.0 MB/sec    1.00    161.1±5.13µs   971.1 MB/sec
roundtrip/nested/8192x8       1.00   927.1±39.57µs  1350.1 MB/sec    1.02   941.8±46.61µs  1329.0 MB/sec
roundtrip/variable/65536x1    1.00  1285.3±51.72µs  1750.7 MB/sec    1.49  1910.5±128.29µs  1177.8 MB/sec
roundtrip/variable/65536x8    1.00     14.8±0.74ms  1217.3 MB/sec    1.01     15.0±0.53ms  1199.3 MB/sec
roundtrip/variable/8192x1     1.00    210.4±5.49µs  1337.6 MB/sec    1.00    209.4±6.81µs  1344.0 MB/sec
roundtrip/variable/8192x8     1.00  1251.7±29.66µs  1798.6 MB/sec    1.53  1914.9±122.75µs  1175.7 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	345.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	352.2s
CPU sys	71.6s
Peak spill	0 B

branch

Metric	Value
Wall time	340.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	333.1s
CPU sys	84.7s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-06-04T14:58:59Z

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

Looks like the results are reproducable -- next step would be to profile it to see if you can find the answer

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies 🤔

Rich-T-kid · 2026-06-04T15:41:05Z

i'm suspecting the issue has to do with
let (client, server) = tokio::io::duplex(1024 * 1024);
per the docs "The max_buf_size argument is the maximum amount of bytes that can be written to a side before the write returns Poll::Pending."

The two regression cases both involve large variable-length data where the encoded payload can be huge:
roundtrip/variable/8192x8 — 8 columns × 8192 rows
roundtrip/variable/65536x1 — 65536 rows, large values buffer

This also shows up in the regression cases,
roundtrip/variable/8192x8 1.00 1251.7±29.66µs 1798.6 MB/sec 1.53 1914.9±122.75µs 1175.7 MB/sec
roundtrip/variable/65536x1 1.00 1285.3±51.72µs 1750.7 MB/sec 1.49 1910.5±128.29µs 1177.8 MB/sec
throughput falls flat.

taking a look at the other benchmark results this seems consistant,
roundtrip/fixed/65536x8 1.00 2.2±0.03ms 1821.4 MB/sec 1.19 2.6±0.16ms 1532.6 MB/sec throughput shrinks and as such causes more blocking to happen.

Even in the event where this isn't the reason for the slow down I think 1MB is still to small for realistic max throughput.

Rich-T-kid · 2026-06-04T15:54:17Z

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies

I think this is a strong possibility after looking at the profile, from my understanding this is mostly in arrow-flight itself and not arrow-ipc. Since arrow-flight is very dependent on arrow-ipc it make sense to start from the ground up with these optimizations.

# Which issue does this PR close?  - Closes #10029. # Rationale for this change Increase the duplex buffer from 1 MB to 64 MB to eliminate artificial back-pressure in the roundtrip benchmarks. See rational in this [comment](#10044 (comment))  # What changes are included in this PR? bumps `max_buf_size` to 64**MB**  # Are these changes tested? n/a  # Are there any user-facing changes? n/a

Rich-T-kid · 2026-06-05T21:07:21Z

I've been working on this all day and any further changes risk scope creep, so I'd like to split the IPC StreamWriter/FileWriter improvements and the arrow-flight work into separate PRs. since benchmarks for those look good & they are unrelated to the arrow-flight work.
Due to async polling it's hard to distinguish what copies happen at the tonic level versus in arrow-flight. I've been profiling this locally with an additional encode_to_send benchmark that measures the full path via do_put [will include in follow up PR] @alamb does that sound like a good idea?

One question: does anyone have insight into why the benchmarks behave differently on the CI workers versus locally? I saw something similar in distributed DataFusion and it came down to thread count, but beyond thread count and CPU cache size I'm not sure what else could explain the difference at this scale.

…rom record_batch_to_bytes

Rich-T-kid · 2026-06-07T20:20:59Z

results here seem in line with [this](https://github.com//pull/10044#issuecomment-4621838941)

I removed any logic that was touching arrow-flights path. This PR focuses on removing intermediary buffer allocations for the StreamWriter and FIleWriter. Instead of accumulating all buffer bytes into a heap allocation before writing, buffer pointers are collected during encoding and streamed directly to the underlying writer once the FlatBuffer header is built.

Rich-T-kid · 2026-06-07T20:29:31Z

The diff looks larger than it is due to estimate_encoded_buffer_count() (which estimates how many slots to pre-allocate for the buffer vector) and the if statements that determine whether write_buffers() or collect_encoded_buffers() is called based on whether a Writer was supplied. This avoids duplicating logic by reusing shared functionality across both code paths.
@gabotechs could you take another look at this when you get a chance 🚀

gabotechs · 2026-06-08T06:12:25Z

run benchmarks flight

adriangbot · 2026-06-08T06:14:49Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4645915589-475-qqxw6 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (a3f9c53) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-08T06:27:17Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    269.0±1.27µs   934.5 MB/sec    1.05    282.8±1.13µs   889.2 MB/sec
encode/dict/65536x8           1.00      4.0±0.03ms   499.8 MB/sec    1.43      5.7±0.03ms   350.5 MB/sec
encode/dict/8192x1            1.00     35.5±0.43µs   920.9 MB/sec    1.03     36.6±0.04µs   891.5 MB/sec
encode/dict/8192x8            1.01    306.2±4.60µs   853.4 MB/sec    1.00    301.9±2.32µs   865.4 MB/sec
encode/fixed/65536x1          1.00      9.9±0.02µs    49.5 GB/sec    1.03     10.1±0.02µs    48.3 GB/sec
encode/fixed/65536x8          1.03   1121.5±1.42µs     3.5 GB/sec    1.00   1087.5±2.55µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.2±0.01µs    19.3 GB/sec    1.00      3.2±0.01µs    19.4 GB/sec
encode/fixed/8192x8           1.00     17.6±0.03µs    27.7 GB/sec    1.01     17.8±0.05µs    27.5 GB/sec
encode/nested/65536x1         1.00     37.5±0.19µs    32.6 GB/sec    1.15     42.9±0.36µs    28.4 GB/sec
encode/nested/65536x8         1.15      3.0±0.01ms     3.3 GB/sec    1.00      2.6±0.01ms     3.8 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.4 GB/sec    1.02      5.9±0.01µs    26.0 GB/sec
encode/nested/8192x8          1.00     46.5±0.08µs    26.3 GB/sec    1.05     49.1±0.13µs    24.9 GB/sec
encode/variable/65536x1       1.00     72.7±0.39µs    30.2 GB/sec    1.12     81.5±0.33µs    27.0 GB/sec
encode/variable/65536x8       1.00      5.0±0.03ms     3.5 GB/sec    1.12      5.6±0.05ms     3.1 GB/sec
encode/variable/8192x1        1.00      6.8±0.02µs    40.5 GB/sec    1.03      7.0±0.01µs    39.1 GB/sec
encode/variable/8192x8        1.00     80.9±0.16µs    27.2 GB/sec    1.02     82.7±0.21µs    26.6 GB/sec
roundtrip/dict/65536x1        1.00  1281.9±43.73µs   196.1 MB/sec    1.01  1300.0±43.37µs   193.4 MB/sec
roundtrip/dict/65536x8        1.02     14.7±0.61ms   137.1 MB/sec    1.00     14.3±0.60ms   140.4 MB/sec
roundtrip/dict/8192x1         1.00    204.0±5.97µs   160.1 MB/sec    1.03    209.5±5.67µs   155.9 MB/sec
roundtrip/dict/8192x8         1.00  1333.1±41.92µs   196.0 MB/sec    1.01  1350.3±43.61µs   193.5 MB/sec
roundtrip/fixed/65536x1       1.00    311.1±3.95µs  1607.4 MB/sec    1.01    315.5±3.73µs  1585.1 MB/sec
roundtrip/fixed/65536x8       1.00      2.1±0.02ms  1873.4 MB/sec    1.01      2.1±0.03ms  1861.8 MB/sec
roundtrip/fixed/8192x1        1.00     91.2±1.40µs   686.4 MB/sec    1.00     91.4±0.86µs   684.5 MB/sec
roundtrip/fixed/8192x8        1.00    328.6±4.67µs  1524.1 MB/sec    1.02    336.4±3.39µs  1488.6 MB/sec
roundtrip/nested/65536x1      1.00   849.6±40.97µs  1471.5 MB/sec    1.01   857.6±41.57µs  1457.7 MB/sec
roundtrip/nested/65536x8      1.00      8.3±0.34ms  1200.9 MB/sec    1.16      9.7±0.36ms  1034.4 MB/sec
roundtrip/nested/8192x1       1.00    158.6±5.50µs   986.4 MB/sec    1.01    159.9±5.60µs   978.3 MB/sec
roundtrip/nested/8192x8       1.00   904.4±45.78µs  1383.9 MB/sec    1.00   904.4±43.74µs  1383.9 MB/sec
roundtrip/variable/65536x1    1.00  1212.7±41.75µs  1855.5 MB/sec    1.01  1224.4±31.76µs  1837.8 MB/sec
roundtrip/variable/65536x8    1.00     16.1±0.50ms  1120.5 MB/sec    1.02     16.3±0.57ms  1101.1 MB/sec
roundtrip/variable/8192x1     1.00    204.7±5.69µs  1374.6 MB/sec    1.02    209.1±5.61µs  1345.9 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±21.91µs  1858.7 MB/sec    1.01  1219.1±27.99µs  1846.7 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	335.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	348.5s
CPU sys	65.9s
Peak spill	0 B

branch

Metric	Value
Wall time	340.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	346.0s
CPU sys	71.2s
Peak spill	0 B

File an issue against this benchmark runner

gabotechs · 2026-06-08T06:37:08Z

-            ipc_message: finished_data.to_vec(),
-            arrow_data,
-        })
+        if let Some(w) = writer {


We need to find another way of modeling this without falling into "if-driven-development" patterns.

Whenever you find yourself coding something switching execution branches with completely different bodies over booleans, it's an indicator that you are falling into an "if-driven-development" pattern, and this is will greatly hurt maintenance in the future.

One way of trying to improve this situation can be:

Refactor things preferring code duplication, and split into different functions with different responsibilities, even if that implies copy-pasting big quantities of code.

Once you have the different functions with a fair amount of copy-pasted code, try to see what are the common bits, and factor them out little by little, trying to progressively reduce the LOC count in each one.

If you still see functions that need to accept bool or Option parameters that have the capability of switching how the function behaves, that means the function was the wrong abstraction on the first place, so don't be afraid of tearing existing functions apart if that allows you to reuse smaller bits in different places without switching on if statements.

gabotechs · 2026-06-08T06:39:38Z

    arrow_data: &mut Vec<u8>,
+    encoded_buffers: &mut Vec<EncodedBuffer>,
    nodes: &mut Vec<crate::FieldNode>,
    offset: i64,
    num_rows: usize,
    null_count: usize,
    compression_codec: Option<CompressionCodec>,
    compression_context: &mut CompressionContext,
    write_options: &IpcWriteOptions,
+    is_direct: bool,
 ) -> Result<i64, ArrowError> {
    let mut offset = offset;


This is another example of my comment above, but this time reinforced by a big Clippy bypass at the top.

Ideally, we should not be contributing towards stronger Clippy violations. If you try to follow the steps above, there are good chances that we can either maintain the width of this function's signature, or even reduce it.

Rich-T-kid · 2026-06-08T18:07:33Z

Replaced the (arrow_data(), encoded_buffers(), is_direct()) triple threaded through write_array_data() with a single BufferSink passed by the caller, eliminating every if is_direct { collect_encoded_buffers(...) } else { write_buffer(...) } branch that was repeated at each buffer write site. All path-specific logic is now encapsulated once in encode_sink_buffer(), which write_array_data() calls uniformly regardless of whether the output is headed for an in-memory EncodedData or a direct stream write. About -400 LOC @gabotechs

gabotechs · 2026-06-08T19:22:29Z

run benchmarks flight

adriangbot · 2026-06-08T19:27:15Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4652613611-485-hmsf5 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (70a7567) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-08T19:39:08Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    285.2±1.80µs   881.4 MB/sec    1.00    285.3±1.50µs   881.1 MB/sec
encode/dict/65536x8           1.34      9.6±0.14ms   209.0 MB/sec    1.00      7.2±0.15ms   280.1 MB/sec
encode/dict/8192x1            1.05     37.1±0.06µs   881.5 MB/sec    1.00     35.2±0.11µs   926.7 MB/sec
encode/dict/8192x8            1.00    305.6±1.28µs   854.9 MB/sec    1.03    315.1±1.20µs   829.1 MB/sec
encode/fixed/65536x1          1.01     10.4±0.04µs    47.1 GB/sec    1.00     10.2±0.01µs    47.8 GB/sec
encode/fixed/65536x8          1.00   1085.1±7.18µs     3.6 GB/sec    1.03   1112.7±9.29µs     3.5 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.1 GB/sec    1.06      3.2±0.01µs    18.9 GB/sec
encode/fixed/8192x8           1.02     17.8±0.04µs    27.5 GB/sec    1.00     17.5±0.03µs    28.0 GB/sec
encode/nested/65536x1         1.35     38.4±0.26µs    31.8 GB/sec    1.00     28.5±0.18µs    42.8 GB/sec
encode/nested/65536x8         1.00      3.1±0.09ms     3.1 GB/sec    1.03      3.2±0.08ms     3.0 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.01µs    26.5 GB/sec
encode/nested/8192x8          1.00     45.8±0.19µs    26.7 GB/sec    1.05     48.1±0.14µs    25.4 GB/sec
encode/variable/65536x1       1.00     62.9±0.37µs    35.0 GB/sec    1.11     69.6±0.43µs    31.6 GB/sec
encode/variable/65536x8       1.00      5.7±0.24ms     3.1 GB/sec    1.03      5.9±0.16ms     3.0 GB/sec
encode/variable/8192x1        1.00      7.1±0.01µs    38.4 GB/sec    1.00      7.1±0.02µs    38.5 GB/sec
encode/variable/8192x8        1.00     79.7±0.58µs    27.6 GB/sec    1.00     79.6±0.26µs    27.6 GB/sec
roundtrip/dict/65536x1        1.01  1319.3±49.97µs   190.6 MB/sec    1.00  1311.8±42.32µs   191.7 MB/sec
roundtrip/dict/65536x8        1.00     15.0±0.58ms   134.3 MB/sec    1.07     16.0±0.67ms   125.9 MB/sec
roundtrip/dict/8192x1         1.00    207.5±5.56µs   157.4 MB/sec    1.00    208.5±5.51µs   156.7 MB/sec
roundtrip/dict/8192x8         1.00  1346.9±43.49µs   194.0 MB/sec    1.00  1345.0±45.04µs   194.3 MB/sec
roundtrip/fixed/65536x1       1.00    305.0±4.83µs  1639.8 MB/sec    1.03    314.3±3.73µs  1591.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.04ms  1791.3 MB/sec    1.00      2.2±0.05ms  1799.0 MB/sec
roundtrip/fixed/8192x1        1.00     90.4±1.16µs   692.6 MB/sec    1.01     91.7±1.52µs   682.7 MB/sec
roundtrip/fixed/8192x8        1.01    329.1±3.56µs  1521.5 MB/sec    1.00    327.4±4.78µs  1529.3 MB/sec
roundtrip/nested/65536x1      1.00   871.2±51.26µs  1435.0 MB/sec    1.00   868.4±50.87µs  1439.7 MB/sec
roundtrip/nested/65536x8      1.00     10.1±0.67ms   992.0 MB/sec    1.05     10.6±0.49ms   946.4 MB/sec
roundtrip/nested/8192x1       1.00    158.4±5.60µs   987.7 MB/sec    1.00    158.5±5.00µs   987.2 MB/sec
roundtrip/nested/8192x8       1.00   916.8±42.83µs  1365.2 MB/sec    1.00   915.9±45.38µs  1366.6 MB/sec
roundtrip/variable/65536x1    1.00  1247.8±36.58µs  1803.3 MB/sec    1.05  1304.0±100.89µs  1725.6 MB/sec
roundtrip/variable/65536x8    1.00     16.5±0.56ms  1092.2 MB/sec    1.04     17.1±0.62ms  1051.2 MB/sec
roundtrip/variable/8192x1     1.00    206.9±6.52µs  1360.3 MB/sec    1.01    209.6±6.14µs  1342.6 MB/sec
roundtrip/variable/8192x8     1.00  1237.0±25.68µs  1820.1 MB/sec    1.00  1241.4±39.06µs  1813.6 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	345.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	348.1s
CPU sys	76.0s
Peak spill	0 B

branch

Metric	Value
Wall time	335.1s
Peak memory	3.4 GiB
Avg memory	3.4 GiB
CPU user	339.6s
CPU sys	73.9s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-06-08T20:26:37Z

I am excited by this one -- I am starting to check it out

github-actions Bot added the arrow Changes to the arrow crate label Jun 1, 2026

Rich-T-kid commented Jun 1, 2026

View reviewed changes

Comment thread arrow-ipc/src/writer.rs Outdated

Rich-T-kid mentioned this pull request Jun 2, 2026

[#10029][benchmarks] arrow-flight roundtrip as well as encode/decode #10031

Merged

gabotechs reviewed Jun 3, 2026

View reviewed changes

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from fc1dbb8 to ecb37f2 Compare June 3, 2026 18:46

Rich-T-kid mentioned this pull request Jun 4, 2026

Bump max throughput in flight benchmark before blocking #10070

Merged

Rich-T-kid added 6 commits June 4, 2026 14:05

remove repeated buffer copies in encoding pipeline

c3c87a8

fixed some linting issues

fd3a475

minor clean up. good to push

213c8e7

fix CI

6df331c

refactored PR to be easier to read/ dedupe logic

781c75c

linting fix

7acd810

alamb changed the title ~~Rich t kid/optimize arrow ipc copies~~ Avoid Arrow IPC Copies Jun 5, 2026

Rich-T-kid added 3 commits June 5, 2026 17:28

sharing current work, planning to split this into two PR

742b0c0

split PR into Stream/Filter Writer focused

11e9ac9

remove minor metadata tracking overhead. Include extra return value f…

3ff94b0

…rom record_batch_to_bytes

github-actions Bot added the arrow-flight Changes to the arrow-flight crate label Jun 7, 2026

remove local benchmarks

8d1fc4d

github-actions Bot removed the arrow-flight Changes to the arrow-flight crate label Jun 7, 2026

Rich-T-kid commented Jun 7, 2026

View reviewed changes

Comment thread arrow-ipc/src/writer.rs

Rich-T-kid commented Jun 7, 2026

View reviewed changes

Comment thread arrow-ipc/src/writer.rs

trigger CI

ec99b7e

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from cbc1d46 to ce6b828 Compare June 8, 2026 00:03

trigger CI

095d7fd

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from ce6b828 to 095d7fd Compare June 8, 2026 00:05

trigger CI

a3f9c53

gabotechs reviewed Jun 8, 2026

View reviewed changes

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 9ec873d to 83102e2 Compare June 8, 2026 18:07

removed un-needed code duplication in favor of an enum

70a7567

Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 83102e2 to 70a7567 Compare June 8, 2026 18:14

alamb changed the title ~~Avoid Arrow IPC Copies~~ Avoid some copies in Arrow IPC Jun 8, 2026

Conversation

Rich-T-kid commented Jun 1, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

benchmarks

Are there any user-facing changes?

Uh oh!

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gabotechs Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results from #10031

Uh oh!

Rich-T-kid commented Jun 3, 2026

Uh oh!

Rich-T-kid commented Jun 3, 2026

Uh oh!

Rich-T-kid commented Jun 3, 2026

Uh oh!

alamb commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

gabotechs commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

Rich-T-kid commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

adriangbot commented Jun 4, 2026

Uh oh!

alamb commented Jun 4, 2026

Uh oh!

Rich-T-kid commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rich-T-kid commented Jun 4, 2026

Uh oh!

Rich-T-kid commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rich-T-kid commented Jun 7, 2026

Uh oh!

Uh oh!

Rich-T-kid commented Jun 7, 2026

Uh oh!

Uh oh!

gabotechs commented Jun 8, 2026

Uh oh!

adriangbot commented Jun 8, 2026

Uh oh!

adriangbot commented Jun 8, 2026

Uh oh!

Rich-T-kid commented Jun 3, 2026 •

edited

Loading

Rich-T-kid commented Jun 4, 2026 •

edited

Loading

Rich-T-kid commented Jun 4, 2026 •

edited

Loading

Rich-T-kid commented Jun 5, 2026 •

edited

Loading

Rich-T-kid commented Jun 8, 2026 •

edited

Loading