DoT is a high-performance distributed reasoning framework for Large Language Models. It extends the Graph of Thoughts paradigm into a directed acyclic graph (DAG) structure, enabling LLMs to solve complex multi-step problems by exploring parallel reasoning paths, merging insights across branches, and dynamically managing the search budget through a vectorized control policy.
DoT is architected for distributed execution at scale: it uses MPI (via mpi4py) for the hierarchical reasoning orchestration layer and Ray Serve with vLLM for high-throughput, GPU-accelerated model inference.
DoT bridges the gap between algorithmic reasoning and bare-metal HPC orchestration, shifting the execution bottleneck from synchronous CPU logic strictly back to physical accelerators.
- Massive Throughput & Cost Reduction: Bare-metal evaluations demonstrate up to a 122.5× speedup over sequential Structured Reasoning Traversals (SRTs). Extrapolated to a 10,000-task heuristic search, DoT compresses the workload makespan from 116 days to 15.5 hours, reducing estimated cloud compute costs by 99%.
- Elimination of Parallel Starvation: By decoupling state synchronization and deterministic tool execution from the tensor backend, DoT's asynchronous control plane hides high-latency operations behind continuous GPU compute, sustaining 94.0% median GPU pressure.
- Deterministic Agentic Offloading: Requires all deterministic operations and math to be offloaded to an external Python sandbox (
CodeSandbox). This prevents context-window saturation and hallucination loops, allowing the LLM to act purely as a stateless router for latent synthesis. - Scale-Out MPI Orchestration: Proven scalability across multi-node, multi-GPU clusters. Dual-payload I/O isolation cuts dispatch bandwidth by 87.5%, while strict tail-latency preemption sustains 67.5% strong-scaling and 77.0% weak-scaling makespan efficiency on 16-node multi-hop topologies.
- System Architecture Overview — The three-tier MPI hierarchy (GlobalOrchestrator → NodeCoordinator → MPIWorker), the Ray Serve gRPC serving layer, and end-to-end task lifecycle.
- Performance & Throughput — Asynchronous pipelining, sequence-parallel decoding, KV-cache affinity routing, and NUMA topology binding.
- Accuracy & Reasoning — The 5×5 vectorized controller, speculative ghost nodes, DAG convergence, and deterministic code sandbox verification.
- Benchmarking & Profiling — Slurm dispatcher workflows, NUMA binding modes, Python
cProfileanalysis, and PyTorch vLLM tracing. - Configuration Reference — A complete breakdown of all
config.yamlparameters, default values, and operational behaviors for tuning the framework.
- Python 3.10+
- CUDA-compatible GPUs (for vLLM inference)
- An MPI implementation (e.g., OpenMPI 4.x+)
# Clone the repository
git clone https://github.com/eshama1/digraph-of-thoughts.git
cd digraph-of-thoughts
# Run the automated setup script (creates ENV/, installs all dependencies)
bash scripts/utils/setup_env.sh
# Activate the environment
source ENV/bin/activateTip
If you already have a pre-configured Python environment (e.g., a shared conda or venv from a different project), you can symlink it to ENV/ to allow the project scripts to access it directly: ln -s /path/to/your/env ENV.
Download model weights to the project root (e.g., ./Llama-3.3-70B-Instruct). The dispatcher uses rsync to synchronize these to $SCRATCH (node-local fast storage) at job startup to reduce multi-node cold-boot time.
Important
Large models (70B+) must be stored in the project root. The dispatcher automatically detects and syncs them. Do not place weights in results/ or ENV/.
For local debugging and small-scale verification, use the lightweight run_dot.py entry point. It runs the full DoT algorithm using a single-process subprocess sandbox — no MPI or Ray cluster required:
python scripts/core/run_dot.py --task sorting --test 032For distributed execution across a Slurm cluster, see the Benchmarking & Profiling docs.
digraph_of_thoughts/ # Core package
algorithm/ # Reasoning algorithm (Scheduler, Controller, Graph, Sandbox)
core/ # AsyncPipelinedScheduler, DoTController, PromptManager
graph/ # DoTGraph (DAG) and DoTNode state machine
agentic/ # CodeSandbox, RayExecutor, LocalExecutor
utils/ # Config, serialization, XML parsing, logging
orchestration/ # Distributed execution layer
core.py # DoTEngine — high-level entry point for single/multi-task runs
mpi/ # Three-tier MPI hierarchy
orchestrator.py # GlobalOrchestrator (Rank 0) — task dispatch and hedging
coordinator.py # NodeCoordinator (local Rank 0) — per-node worker pool
worker.py # MPIWorker — per-rank reasoning execution
common.py # Shared MPI tags, serialization, and persistence helpers
model_serving/ # vLLM on Ray Serve
deployment.py # Ray Serve deployment lifecycle
vllm_service.py # gRPC streaming service with metrics collection
vllm_client.py # Async gRPC client (used by MPI workers)
local_service.py # Single-GPU AsyncLLMEngine backend (no Ray)
openai_service.py # OpenAI API compatibility layer
mock_service.py # CPU-only synthetic backend for control-plane profiling
scripts/
core/ # Primary execution entry points
run_dot.py # Lightweight local runner (subprocess sandbox)
run_distributed_dot.py # Full MPI+Ray distributed runner
orchestration/ # Cluster lifecycle management
deploy_model.sh # Deploy vLLM on Ray Serve
wait_for_model.py # Poll until Ray Serve reaches RUNNING state
wait_for_workers.py # Poll until all MPI workers are ready
shutdown_deployment.py # Gracefully terminate Ray Serve and flush metrics
analysis/ # Post-run result processing and visualization
benchmarking/ # Sweep scripts, ablations, microbenchmarks, scaling batches
utils/ # Setup, proto generation, config override helpers
examples/ # Task-specific implementations (Sorting, BBEH, etc.)
docs/ # Extended architectural documentation
benchmark_dispatcher.py # Root-level Slurm dispatch orchestrator
default_config.yaml # Default algorithm and serving configuration
This project uses evaluation datasets from the Graph of Thoughts (GoT) framework. We thank the authors for open-access benchmarks.
The baseline comparison scripts (IO, CoT, ToT, GoT) in scripts/benchmarking/GoT_test_scripts/ are adapted from the original GoT implementations.
If you use this framework or the included GoT-based examples in your research, please cite both our work and the original GoT paper:
@article{shama2026dot,
title = {{Digraph of Thoughts: Scalable Distributed Reasoning with Large Language Models}},
author = {Shama, Ethan and Grant, Ryan E.},
year = 2026,
note = {Under review at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '26)}
}@article{besta2024got,
title = {{Graph of Thoughts: Solving Elaborate Problems with Large Language Models}},
author = {Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and
Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Podstawski, Micha{\l}
and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten},
year = 2024,
month = {Mar},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = 38,
number = 16,
pages = {17682--17690},
publisher = {AAAI Press},
doi = {10.1609/aaai.v38i16.29720},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/29720}
}This project is licensed under the BSD-style license. See the LICENSE file for details.