"In all fictions, each time a man meets diverse alternatives, he chooses one and eliminates the others; in the work of the almost unfathomable Ts'ui Pên, he chooses — simultaneously — all of them."
— Jorge Luis Borges, "The Garden of Forking Paths" (1941)
SSD is a new LLM inference algorithm. It is exact, and it is extremely fast.
SSD is a new type of speculative decoding (SD). In normal SD, a small and fast model guesses the next few tokens that a larger slower model may generate, and the large model then verifies them in one forward pass: drafting and verification happen one after the other on the same hardware.
In SSD, they happen in parallel, on distinct hardware. The small model anticipates likely verification outcomes in advance, and speculates for all of them at once. If it guessed correctly, the speculation can be returned immediately so drafting overhead is eliminated entirely.
This custom inference engine supports:
- A reference implementation of the SSD algorithm
- Optimized SD and autoregressive baselines
- Qwen3 + Llama3 model families
- Tensor Parallelism
- PagedAttention, CUDAgraphs, torch compilation, prefix caching
As a result, SSD achieves up to 2x faster inference than some of the strongest inference baselines in the world.
ssd_demo_final.mp4 |
Requirements: Python 3.11+, CUDA >= 12.8. This code was written and tested on H100s.
If uv is not installed:
curl -LsSf https://astral.sh/uv/install.sh | sh
# if `uv` is not found in this shell:
export PATH="$HOME/.local/bin:$PATH"Then:
git clone https://github.com/tanishqkumar/ssd && cd ssd
uv sync # core SSD deps
# uv sync --extra scripts # add deps used by scripts/
source .venv/bin/activate
python -c "from ssd import LLM; print('ok')"Set paths via environment variables. SSD_HF_CACHE should point to the HuggingFace hub directory — this is the directory that contains models--org--name/ subdirectories (e.g. /data/huggingface/hub, not /data/huggingface/). SSD_DATASET_DIR should point to the directory containing the dataset subdirectories (humaneval/, alpaca/, etc).
export SSD_HF_CACHE=/path/to/huggingface/hub
export SSD_DATASET_DIR=/path/to/processed_datasets
export SSD_CUDA_ARCH=9.0 # 9.0=H100, 8.0=A100, 8.9=L40/4090If you already have the models downloaded via huggingface-cli or similar, you can skip straight to datasets — just make sure SSD_HF_CACHE points to the right place. The download scripts require the scripts extra: uv sync --extra scripts.
# models (uses SSD_HF_CACHE)
python scripts/download_from_hf.py llama
# datasets (writes to $HF_DATASETS_CACHE/processed_datasets)
export HF_DATASETS_CACHE=/path/to # parent of SSD_DATASET_DIR
python scripts/get_data_from_hf.py --num-samples 10000All commands below run from inside the bench/ directory. Large models (Llama-3 70B, Qwen-3 32B) take a few minutes for load/warmup/compile before generation starts. Always use python -O to disable debug overhead.
Use --all for full eval across four datasets. Since different data distributions are predictable to varying degrees, the speed of SD/SSD depends a lot on the dataset. Averaging over many prompts from many types of datasets
gives an overall picture. --numseqs is per-dataset, so --numseqs 128 --all runs 128 × 4 = 512 prompts total.
cd bench
# AR — Llama 70B, 4 GPUs
python -O bench.py --llama --size 70 --gpus 4 --b 1 --temp 0 --numseqs 128 --output_len 512 --all
# Sync spec decode — 70B target + 1B draft, 4 GPUs, k=6
python -O bench.py --llama --size 70 --gpus 4 --spec --k 6 --b 1 --temp 0 --numseqs 128 --output_len 512 --all
# Async spec decode (SSD) — 70B target (4 GPUs) + 1B draft (1 GPU), k=7, f=3
python -O bench.py --llama --size 70 --gpus 5 --spec --async --k 7 --f 3 --b 1 --temp 0 --numseqs 128 --output_len 512 --allUse --qwen --size 32 for Qwen models. See bench/bench.py for full args. For SGLang/vLLM baselines, see bench/README.md.
Interactive streaming chat with Llama-3.1 70B only. Supports AR, sync SD, and async SD (SSD). Pass --metrics to print token count, speed, and TTFT after each response.
cd bench
# AR — 4 GPUs
python -O chat.py --ssd --gpus 4
# Sync spec decode — 4 GPUs, k=6
python -O chat.py --ssd --spec --k 6 --gpus 4
# Async spec decode (SSD) — 5 GPUs, k=7, f=3
python -O chat.py --ssd --spec --async --k 7 --f 3 --gpus 5 --metricsSGLang and vLLM chat backends are also supported (launches their servers automatically) for comparison:
python -O chat.py --sglang # spec decode
python -O chat.py --sglang --ar # autoregressive
python -O chat.py --vllm # spec decodeFeatures that will be supported in the near future:
- Draft data parallel (increase speculation cache size) on up to 4 devices to avoid getting compute bound
- OpenAI-compatible inference over HTTP
- New models and MoE support: GPT-OSS and Kimi-K2.5.
Contributions welcome!
Speculative Speculative Decoding will appear at ICLR 2026.
@misc{kumar2026speculativespeculativedecoding,
title={Speculative Speculative Decoding},
author={Tanishq Kumar and Tri Dao and Avner May},
year={2026},
eprint={2603.03251},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.03251},
}