Skip to content

feat(cluster): TP-default dispatch + TensorParallelEngine scaffold#194

Open
anupsv wants to merge 4 commits into
rdma-connection-testfrom
feat/tp-default-dispatch
Open

feat(cluster): TP-default dispatch + TensorParallelEngine scaffold#194
anupsv wants to merge 4 commits into
rdma-connection-testfrom
feat/tp-default-dispatch

Conversation

@anupsv
Copy link
Copy Markdown
Contributor

@anupsv anupsv commented May 21, 2026

Summary

Adds the parallelism dispatcher and tensor-parallel inference scaffold to provider-swift. TP becomes the default strategy on 2-Mac Thunderbolt 5 clusters; PP remains as the fallback for models without a *TP variant, hardware that doesn't support DistributedGroup (non-M5 / rdma_ctl disabled), or when heads don't shard evenly.

Why TP > PP for 2-Mac single-stream decode

PP (existing) TP (new)
Per-token rank-0 work layers 0..N/2 layers 0..N at half-width
Per-token rank-1 work layers N/2..N layers 0..N at half-width
Per-Mac GPU utilization ~50% (ranks take turns) ~100% (both ranks always working)
Per-token cross-Mac transfers 1 activation tensor 2N small allreduces
Single-stream decode latency ~T_compute ~T_compute / 2 + 2N · T_allreduce ≈ ½ × PP on TB5

Stack

This PR is the third in a 3-PR stack. The submodule pointers in this PR pick up the heads of the matching PRs:

Layer PR Submodule SHA
Cmlx library product + jaccl backend Layr-Labs/mlx-swift#2 (prerequisite, already on this branch's base)
Sharded linear primitives Layr-Labs/mlx-swift#3 libs/mlx-swifta83e602
LlamaModelTP Layr-Labs/mlx-swift-lm#25 libs/mlx-swift-lmb06fa03
This PR: dispatch + CLI flag (here)

All three forks' PRs need to merge before this lands cleanly on master.

What's in this PR

New files

File Purpose
provider-swift/Sources/ProviderCore/P2P/Parallelism.swift Parallelism enum (tp/pp/single/auto) + Parallelism.decide(DecisionInputs) returning (chosen, reason). Encapsulates the fallback table below.
provider-swift/Sources/ProviderCore/P2P/TensorParallelInference.swift TensorParallelEngine (rank 0) + TensorParallelServer (rank 1) scaffolds, parallel to EncryptedPipelineEngine / EncryptedPipelineServer. The decode loop is intentionally TODO — see "Status" section below.
provider-swift/Tests/ProviderCoreTests/ParallelismTests.swift 14 tests covering every branch of the decision table. All pass.

Modified

File Change
provider-swift/Sources/darkbloom/StartCommand.swift New --parallelism {auto, tp, pp, single} flag (default auto). Logged at RDMA-enabled startup.
libs/mlx-swift submodule Bumped to a83e602 (PR #3 head).
libs/mlx-swift-lm submodule Bumped to b06fa03 (PR #25 head).

Auto-decide fallback table

operatorChoice Condition → Chosen Reason
any worldSize < 2 single "no cluster peer connected"
single single operator explicit
pp pp operator explicit
tp distributedGroupAvailable == false single (refuses to downgrade) operator asked for TP; cap missing
tp modelHasTPVariant == false single (refuses to downgrade) operator asked for TP; model has no TP variant
tp heads don't divide single operator asked for TP; heads don't divide
tp all OK tp operator explicit + achievable
auto distributedGroupAvailable == false pp DistributedGroup unavailable
auto modelHasTPVariant == false pp loaded model has no TP variant
auto heads don't divide pp heads don't divide evenly
auto all OK tp auto → tp: all capabilities OK

Explicit --parallelism tp failing closed (rather than silently downgrading to PP) is deliberate — if an operator asked for TP, a missing capability is a misconfiguration worth surfacing instead of masking with a worse-performance fallback.

Status: scaffold, not full runtime wiring

The cluster runtime in ClusterCommand.swift is currently a stub (see line 431: "Integrate EncryptedPipelineEngine with the coordinator request queue to route inference."). Both PP and TP engines exist as code paths but neither is currently driving live inference — the request queue → cluster session integration is the missing piece.

This PR delivers:

What's deferred to a follow-up:

  • ❌ Wiring the engines into ClusterSession / ClusterPeer driven by the coordinator request queue
  • ❌ Real jaccl DistributedGroup initialization (env vars, two-Mac handshake coordination, MR-count tuning)
  • ❌ Two-Mac Thunderbolt 5 smoke test that runs the full TP loop end-to-end
  • ❌ Quantized TP support in LlamaModelTP.sanitize (currently 4-bit weights are passed through unsliced — works on a single rank but doesn't shard memory)

Test plan

  • swift build succeeds via the d-inference workspace
  • swift test --filter "parallelism" passes all 14 dispatcher tests
  • swift test --filter "LlamaTPTests" passes all 8 model-level tests (via the mlx-swift-lm submodule)
  • No regression for existing single-rank or PP code paths — engines are pure additions
  • Two-Mac TB5 smoke test once the runtime wiring lands

Related


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

Adds the Parallelism dispatcher that selects between tensor-parallel,
pipeline-parallel, and single-rank inference based on operator choice
and runtime capability. TP wins by default on 2-Mac Thunderbolt 5
clusters because both Macs run all transformer layers in parallel
with per-layer allreduce, roughly halving single-stream decode latency
vs PP (where one Mac waits while the other runs).

New:
- ProviderCore/P2P/Parallelism.swift: enum {tp, pp, single, auto} +
  decide() that takes DecisionInputs (operatorChoice, worldSize,
  modelHasTPVariant, attentionHeads, kvHeads, distributedGroupAvailable)
  and returns the chosen strategy with an operator-readable reason.
- ProviderCore/P2P/TensorParallelInference.swift: TensorParallelEngine
  (rank 0) and TensorParallelServer (rank 1) scaffolds, parallel to
  EncryptedPipelineEngine/EncryptedPipelineServer. The decode loop is
  intentionally TODO — the cluster runtime is currently a stub at
  ClusterCommand.swift:431 ("Integrate ... with the coordinator request
  queue"). Wiring will land in a follow-up that touches both PP and TP
  engines together.
- darkbloom CLI: --parallelism flag on `serve` (auto, tp, pp, single).
  Defaults to auto. Logged at startup as the operator's preference;
  the final strategy is resolved at session-handshake time when the
  peer's capabilities are known.

Auto fallback (Parallelism.decide auto path):
- worldSize < 2 → single (no peer)
- DistributedGroup unavailable (non-M5, rdma_ctl disabled, macOS < 26.2)
  → pp (TB5 transfer is enough; jaccl isn't required for PP)
- modelHasTPVariant == false (non-Llama until they get *TP variants)
  → pp via callPartial
- heads % worldSize != 0 → pp
- everything OK → tp

Explicit --parallelism tp fails closed when capabilities are missing:
falls back to single rather than silently downgrading to pp, so a
misconfiguration is visible to the operator instead of masked.

Submodule bumps:
- libs/mlx-swift → a83e602 (Layr-Labs/mlx-swift#3, sharded linear primitives)
- libs/mlx-swift-lm → b06fa03 (Layr-Labs/mlx-swift-lm#25, LlamaModelTP)

Tests (Tests/ProviderCoreTests/ParallelismTests.swift):
14 tests covering the auto-decide table, single-rank short-circuit,
explicit operator overrides, fail-closed semantics for --parallelism tp,
and the canShard divisibility helper. All pass.

Related: #193 (upstream mlx-swift distributed deviation tracker).
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview May 21, 2026 7:19am
d-inference-console-ui-dev Ready Ready Preview May 21, 2026 7:19am
d-inference-landing Ready Ready Preview May 21, 2026 7:19am

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-21 07:20 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 3.792s
Throughput 1.1 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 1.714s 1.715s 1.715s 1.715s
parse 4 19µs 16µs 34µs 34µs
reserve 4 2ms 3ms 4ms 4ms
route 4 380µs 394µs 448µs 448µs
coordinator_to_provider 4 1.709s 1.709s 1.71s 1.71s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=19.25µs (threshold=1ms)
parse:p95<=5ms PASS p95=34µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.4945ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.725ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 20
Success 4
Errors 16
Total Duration 2.262s
Throughput 1.8 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.259s 2.26s 2.262s 2.262s
parse 4 0s 1ms 1ms 1ms
reserve 4 7ms 7ms 9ms 9ms
route 4 765µs 812µs 893µs 893µs
coordinator_to_provider 4 1.682s 1.682s 1.684s 1.684s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=455.75µs (threshold=1ms)
parse:p95<=5ms PASS p95=1.174ms (threshold=5ms)
reserve:mean<=50ms PASS mean=6.77475ms (threshold=50ms)
reserve:p95<=200ms PASS p95=9.297ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 4 0.5 GB
mlx-community/gemma-3-270m-4bit 3 0.2 GB
Metric Value
Total Requests 50
Success 50
Errors 0
Total Duration 10.987s
Throughput 4.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 50 610ms 4ms 3.678s 3.685s
parse 49 27µs 17µs 74µs 318µs
reserve 49 1ms 1ms 3ms 3ms
route 49 0s 0s 1ms 1ms
coordinator_to_provider 50 556ms 1ms 3.671s 3.679s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=27.122µs (threshold=1ms)
parse:p95<=5ms PASS p95=74µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.490755ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.922ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 12
Errors 48
Total Duration 3.685s
Throughput 3.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 2.159s 2.155s 2.168s 2.168s
parse 12 14µs 12µs 27µs 27µs
reserve 12 2ms 2ms 3ms 3ms
route 12 516µs 492µs 772µs 772µs
coordinator_to_provider 12 2.151s 2.148s 2.161s 2.161s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=13.583µs (threshold=1ms)
parse:p95<=5ms PASS p95=27µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.166583ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.969ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 40
Success 5
Errors 35
Total Duration 3.141s
Throughput 1.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 5 2.322s 2.164s 2.952s 2.952s
parse 5 18µs 16µs 25µs 25µs
reserve 5 3ms 3ms 3ms 3ms
route 5 588ms 1ms 2.937s 2.937s
queue_wait 1 2.937s 2.937s 2.937s 2.937s
dispatch 1 30µs 30µs 30µs 30µs
coordinator_to_provider 5 1.728s 2.158s 2.158s 2.158s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=18µs (threshold=1ms)
parse:p95<=5ms PASS p95=25µs (threshold=5ms)
reserve:mean<=50ms PASS mean=3.1876ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.324ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:mean<=5ms PASS mean=30µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=30µs (threshold=50ms)

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 60
Errors 0
Total Duration 5.777s
Throughput 10.4 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 60 399ms 5ms 2.388s 2.388s
parse 60 17µs 16µs 31µs 78µs
reserve 60 1ms 1ms 2ms 2ms
route 60 453µs 411µs 783µs 901µs
coordinator_to_provider 60 396ms 2ms 2.38s 2.382s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=17.05µs (threshold=1ms)
parse:p95<=5ms PASS p95=31µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.27505ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.378ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 3.134s
Throughput 1.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.163s 2.163s 2.163s 2.163s
parse 4 18µs 18µs 29µs 29µs
reserve 4 2ms 2ms 3ms 3ms
route 4 515µs 486µs 803µs 803µs
coordinator_to_provider 4 2.158s 2.158s 2.158s 2.158s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=18µs (threshold=1ms)
parse:p95<=5ms PASS p95=29µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.39ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.509ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 4.106s
Throughput 7.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 675ms 7ms 2.028s 2.028s
parse 30 20µs 16µs 45µs 69µs
reserve 30 2ms 1ms 3ms 4ms
route 30 482µs 491µs 768µs 823µs
coordinator_to_provider 30 670ms 5ms 2.021s 2.021s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=19.8µs (threshold=1ms)
parse:p95<=5ms PASS p95=45µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.711433ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.189ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 5 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 3.96s
Throughput 7.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 747ms 4ms 2.251s 2.252s
parse 30 19µs 15µs 49µs 66µs
reserve 30 2ms 1ms 3ms 5ms
route 30 431µs 395µs 771µs 789µs
coordinator_to_provider 30 743ms 1ms 2.243s 2.245s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=19.033µs (threshold=1ms)
parse:p95<=5ms PASS p95=49µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.7278ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.346ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 100
Success 13
Errors 87
Total Duration 3.389s
Throughput 3.8 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 13 2.229s 2.147s 3.22s 3.22s
parse 13 294µs 141µs 980µs 980µs
reserve 13 7ms 7ms 9ms 9ms
route 13 260ms 18ms 3.167s 3.167s
queue_wait 1 3.167s 3.167s 3.167s 3.167s
dispatch 1 59µs 59µs 59µs 59µs
coordinator_to_provider 13 1.939s 2.099s 2.101s 2.101s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=293.769µs (threshold=1ms)
parse:p95<=5ms PASS p95=980µs (threshold=5ms)
reserve:mean<=50ms PASS mean=7.441076ms (threshold=50ms)
reserve:p95<=200ms PASS p95=9.168ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:mean<=5ms PASS mean=59µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=59µs (threshold=50ms)

Picks up the quantized TP variant + matrix-level sharding tests on
Layr-Labs/mlx-swift-lm#25 (commit 56a5ca6). This makes makeLlamaTP()
available for routing the dispatcher to the right TP variant when
a 4-bit / 8-bit quantized checkpoint is loaded.
Picks up Layr-Labs/mlx-swift-lm#25 commit 735c28a — drops the
misleading LoRA conformance on LlamaModelTP/Q (returns [] with a
docstring pointing at LlamaModel for LoRA workflows) and renames the
testLlamaTPRejectsNonDivisibleHeads test to match what it actually
exercises.
TensorParallelEngine / TensorParallelServer held `model: LlamaModelTP`
as a concrete type, which meant the dispatcher couldn't pass the
quantized `LlamaModelTPQ` returned by `makeLlamaTP(args:quantization:group:)`.
Without this fix, the entire quantized-TP path was dead code — type
checking would reject the quantized model at construction time.

Both fields now hold `any LLMModel`. The engine needs only the
callAsFunction(_:cache:) and newCache(parameters:) operations from the
LanguageModel protocol (which LLMModel refines), and both LlamaModelTP
and LlamaModelTPQ conform.

Callers (the dispatcher in ClusterDiscovery, once wired) are
responsible for passing a TP-capable model — a non-TP `LlamaModel`
would type-check but produce single-rank semantics. Document this
explicitly in the property comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant