Skip to content

Per-worker proxy threads for UDF callbacks (#1136)#7

Closed
otegami wants to merge 1 commit into
mainfrom
feature/per-worker-proxy
Closed

Per-worker proxy threads for UDF callbacks (#1136)#7
otegami wants to merge 1 commit into
mainfrom
feature/per-worker-proxy

Conversation

@otegami

@otegami otegami commented May 31, 2026

Copy link
Copy Markdown
Owner

Summary

Implements per-worker proxy threads for scalar and table function UDF
callbacks (refs suketaGH-1136), so callbacks from different DuckDB worker threads
run concurrently instead of serializing through a single global executor.

This is a personal-fork PR for local verification and integration — the
living picture of the whole feature. It is sent upstream (suketa/ruby-duckdb)
one PR at a time, in order:

  1. lifecycle primitive — merged as Add per-worker proxy thread primitive suketa/ruby-duckdb#1364
  2. dispatch wiring — merged as feat: route the non-Ruby-thread dispatch path through an optional proxy suketa/ruby-duckdb#1365
  3. scalar integration — merged as feat: per-worker proxy for scalar function execute callback suketa/ruby-duckdb#1366
  4. table integration — this branch's single remaining commit (next upstream PR)

Why

With one global executor, callbacks from different workers can never overlap,
even when they release the GVL (e.g. on I/O). One proxy thread per DuckDB
worker lifts exactly that ceiling. Measured with sample/issue1136.rb
(GVL-releasing scalar callback over a 500k-row scan):

SET threads=1 SET threads=4
before 1.379s, 1 callback thread 0.976s, 2 callback threads
after 1.374s, 1 callback thread 0.365s, 4 callback threads

The before run caps at 2 threads (calling thread + global executor) no matter
how many workers DuckDB spawns. Pure-CPU callbacks stay bounded by the GVL,
so the win is specific to GVL-releasing UDFs.

Design notes

  • Three-path dispatch (Fix GVL-unsafe callbacks in table_function.c suketa/ruby-duckdb#1280) is preserved; proxies only change Case 3
    (non-Ruby thread): route through the worker's own proxy when present, else
    fall back to the global executor.
  • DuckDB 1.4.x LTS keeps the old path byte-for-byte — all proxy code is gated
    behind HAVE_DUCKDB_H_GE_V1_5_0 (set_init / set_local_init are 1.5.0
    APIs).
  • Proxy structs use calloc/free (not xcalloc/xfree): DuckDB frees
    them from non-Ruby threads. Proxy threads are GC-protected via a global
    array.
  • No public surface changes: the Ruby API (DuckDB::*) and the
    duckdb_native artifact name are untouched.
  • An earlier revision renamed function_executor.{c,h} -> executor.{c,h};
    the rename was dropped (the name is accurate and matches the
    *_function_* file family).

Verification

  • Full suite green at the tip: 1149 runs, 0 failures (the scalar commit's
    1148 are now covered by main). rake compile clean (only the
    pre-existing -Wshorten-64-to-32 warning); RuboCop clean.
  • The lifecycle tests record which Ruby threads run callbacks and assert
    more than two distinct threads — the global executor structurally caps at
    two, so each test fails without its commit (verified red/green against a
    proxy-less build).

Wire the table execute path to per-worker proxy threads on DuckDB
>= 1.5.0. A local_init callback registered via
duckdb_table_function_set_local_init runs once per worker thread, creates
a proxy (allocating its Ruby thread under the GVL through the global
executor, since local_init runs on a non-Ruby thread), and stores it as
thread-local init data via duckdb_init_set_init_data. The execute
callback retrieves that proxy with duckdb_function_get_local_init_data
and dispatches through it via
rbduckdb_function_executor_dispatch_via_proxy, so callbacks from
different workers run concurrently instead of serializing on the single
global executor. bind and init stay on the global executor. DuckDB frees
each proxy through rbduckdb_worker_proxy_destroy.

The proxy-creating wrapper runs rbduckdb_worker_proxy_create under
rb_protect, implementing the raise contract documented on that function:
the executor runs callbacks unprotected, so an uncaught raise would
longjmp past its done-signaling and block the waiting DuckDB worker
forever. On failure the proxy stays NULL and the execute callback falls
back to the global executor.

On DuckDB < 1.5.0 the local_init hook is absent and the execute callback
keeps using the global executor unchanged.

The added test records which Ruby threads run the execute callback and
asserts more than two distinct threads, which the old implementation can
never produce (calling thread plus the single global executor), in
addition to result correctness. Verified to fail against a build without
this change. Simultaneity assertions are avoided as scheduler-dependent.

sample/issue1136.rb gains a table UDF section: a GVL-releasing emitter
with both planner hints (set_cardinality, max_threads) that demonstrates
the throughput win (locally about 4x at SET threads=4, execute callbacks
on 4 distinct Ruby threads instead of 1).
@otegami otegami force-pushed the feature/per-worker-proxy branch from 354fe36 to 6b0fd95 Compare June 7, 2026 07:46
@otegami

otegami commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Done at upstream.

@otegami otegami closed this Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant