feat(cluster): TP-default dispatch + TensorParallelEngine scaffold#194
feat(cluster): TP-default dispatch + TensorParallelEngine scaffold#194anupsv wants to merge 4 commits into
Conversation
Adds the Parallelism dispatcher that selects between tensor-parallel,
pipeline-parallel, and single-rank inference based on operator choice
and runtime capability. TP wins by default on 2-Mac Thunderbolt 5
clusters because both Macs run all transformer layers in parallel
with per-layer allreduce, roughly halving single-stream decode latency
vs PP (where one Mac waits while the other runs).
New:
- ProviderCore/P2P/Parallelism.swift: enum {tp, pp, single, auto} +
decide() that takes DecisionInputs (operatorChoice, worldSize,
modelHasTPVariant, attentionHeads, kvHeads, distributedGroupAvailable)
and returns the chosen strategy with an operator-readable reason.
- ProviderCore/P2P/TensorParallelInference.swift: TensorParallelEngine
(rank 0) and TensorParallelServer (rank 1) scaffolds, parallel to
EncryptedPipelineEngine/EncryptedPipelineServer. The decode loop is
intentionally TODO — the cluster runtime is currently a stub at
ClusterCommand.swift:431 ("Integrate ... with the coordinator request
queue"). Wiring will land in a follow-up that touches both PP and TP
engines together.
- darkbloom CLI: --parallelism flag on `serve` (auto, tp, pp, single).
Defaults to auto. Logged at startup as the operator's preference;
the final strategy is resolved at session-handshake time when the
peer's capabilities are known.
Auto fallback (Parallelism.decide auto path):
- worldSize < 2 → single (no peer)
- DistributedGroup unavailable (non-M5, rdma_ctl disabled, macOS < 26.2)
→ pp (TB5 transfer is enough; jaccl isn't required for PP)
- modelHasTPVariant == false (non-Llama until they get *TP variants)
→ pp via callPartial
- heads % worldSize != 0 → pp
- everything OK → tp
Explicit --parallelism tp fails closed when capabilities are missing:
falls back to single rather than silently downgrading to pp, so a
misconfiguration is visible to the operator instead of masked.
Submodule bumps:
- libs/mlx-swift → a83e602 (Layr-Labs/mlx-swift#3, sharded linear primitives)
- libs/mlx-swift-lm → b06fa03 (Layr-Labs/mlx-swift-lm#25, LlamaModelTP)
Tests (Tests/ProviderCoreTests/ParallelismTests.swift):
14 tests covering the auto-decide table, single-rank short-circuit,
explicit operator overrides, fail-closed semantics for --parallelism tp,
and the canShard divisibility helper. All pass.
Related: #193 (upstream mlx-swift distributed deviation tracker).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Benchmark ResultsRunner: 1-provider-streaming1 providers, 1 users, 30 requests, concurrency=5, streaming=true
Latency Decomposition
Assertion Report: FAIL
1-provider-non-streaming1 providers, 1 users, 20 requests, concurrency=5, streaming=false
Latency Decomposition
Assertion Report: FAIL
7-provider-multi-model7 providers, 5 users, 50 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-high-concurrency3 providers, 10 users, 60 requests, concurrency=20, streaming=true
Latency Decomposition
Assertion Report: FAIL
1-provider-queue-saturation1 providers, 10 users, 40 requests, concurrency=15, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-20-users3 providers, 20 users, 60 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
1-provider-scaling1 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-scaling3 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
5-provider-scaling5 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-heavy-100conc-10kb3 providers, 20 users, 100 requests, concurrency=100, streaming=true
Latency Decomposition
Assertion Report: FAIL
|
Picks up the quantized TP variant + matrix-level sharding tests on Layr-Labs/mlx-swift-lm#25 (commit 56a5ca6). This makes makeLlamaTP() available for routing the dispatcher to the right TP variant when a 4-bit / 8-bit quantized checkpoint is loaded.
Picks up Layr-Labs/mlx-swift-lm#25 commit 735c28a — drops the misleading LoRA conformance on LlamaModelTP/Q (returns [] with a docstring pointing at LlamaModel for LoRA workflows) and renames the testLlamaTPRejectsNonDivisibleHeads test to match what it actually exercises.
TensorParallelEngine / TensorParallelServer held `model: LlamaModelTP` as a concrete type, which meant the dispatcher couldn't pass the quantized `LlamaModelTPQ` returned by `makeLlamaTP(args:quantization:group:)`. Without this fix, the entire quantized-TP path was dead code — type checking would reject the quantized model at construction time. Both fields now hold `any LLMModel`. The engine needs only the callAsFunction(_:cache:) and newCache(parameters:) operations from the LanguageModel protocol (which LLMModel refines), and both LlamaModelTP and LlamaModelTPQ conform. Callers (the dispatcher in ClusterDiscovery, once wired) are responsible for passing a TP-capable model — a non-TP `LlamaModel` would type-check but produce single-rank semantics. Document this explicitly in the property comment.
Summary
Adds the parallelism dispatcher and tensor-parallel inference scaffold to provider-swift. TP becomes the default strategy on 2-Mac Thunderbolt 5 clusters; PP remains as the fallback for models without a
*TPvariant, hardware that doesn't supportDistributedGroup(non-M5 /rdma_ctldisabled), or when heads don't shard evenly.Why TP > PP for 2-Mac single-stream decode
Stack
This PR is the third in a 3-PR stack. The submodule pointers in this PR pick up the heads of the matching PRs:
Cmlxlibrary product + jaccl backendLayr-Labs/mlx-swift#2Layr-Labs/mlx-swift#3libs/mlx-swift→a83e602LlamaModelTPLayr-Labs/mlx-swift-lm#25libs/mlx-swift-lm→b06fa03All three forks' PRs need to merge before this lands cleanly on master.
What's in this PR
New files
provider-swift/Sources/ProviderCore/P2P/Parallelism.swiftParallelismenum (tp/pp/single/auto) +Parallelism.decide(DecisionInputs)returning(chosen, reason). Encapsulates the fallback table below.provider-swift/Sources/ProviderCore/P2P/TensorParallelInference.swiftTensorParallelEngine(rank 0) +TensorParallelServer(rank 1) scaffolds, parallel toEncryptedPipelineEngine/EncryptedPipelineServer. The decode loop is intentionally TODO — see "Status" section below.provider-swift/Tests/ProviderCoreTests/ParallelismTests.swiftModified
provider-swift/Sources/darkbloom/StartCommand.swift--parallelism {auto, tp, pp, single}flag (defaultauto). Logged at RDMA-enabled startup.libs/mlx-swiftsubmodulea83e602(PR #3 head).libs/mlx-swift-lmsubmoduleb06fa03(PR #25 head).Auto-decide fallback table
operatorChoiceworldSize < 2singlesinglesinglepppptpdistributedGroupAvailable == falsesingle(refuses to downgrade)tpmodelHasTPVariant == falsesingle(refuses to downgrade)tpsingletptpautodistributedGroupAvailable == falseppautomodelHasTPVariant == falseppautoppautotpExplicit
--parallelism tpfailing closed (rather than silently downgrading to PP) is deliberate — if an operator asked for TP, a missing capability is a misconfiguration worth surfacing instead of masking with a worse-performance fallback.Status: scaffold, not full runtime wiring
The cluster runtime in
ClusterCommand.swiftis currently a stub (see line 431:"Integrate EncryptedPipelineEngine with the coordinator request queue to route inference."). Both PP and TP engines exist as code paths but neither is currently driving live inference — the request queue → cluster session integration is the missing piece.This PR delivers:
Parallelism.decide) — fully implemented, 14 tests passing--parallelismCLI flag — wired to ProviderLoopConfigTensorParallelEngine/TensorParallelServerAPI shape — matchesEncryptedPipelineEngine/EncryptedPipelineServerso the wiring follow-up is a mechanical swapWhat's deferred to a follow-up:
ClusterSession/ClusterPeerdriven by the coordinator request queueDistributedGroupinitialization (env vars, two-Mac handshake coordination, MR-count tuning)LlamaModelTP.sanitize(currently 4-bit weights are passed through unsliced — works on a single rank but doesn't shard memory)Test plan
swift buildsucceeds via the d-inference workspaceswift test --filter "parallelism"passes all 14 dispatcher testsswift test --filter "LlamaTPTests"passes all 8 model-level tests (via the mlx-swift-lm submodule)Related
ml-explore/mlx-swift#371Need help on this PR? Tag
@codesmithwith what you need. Autofix is disabled.