Speculative Speculative Decoding

Paper

"In all fictions, each time a man meets diverse alternatives, he chooses one and eliminates the others; in the work of the almost unfathomable Ts'ui Pên, he chooses — simultaneously — all of them."

— Jorge Luis Borges, "The Garden of Forking Paths" (1941)

SSD is a new LLM inference algorithm. It is exact, and it is extremely fast.

SSD is a new type of speculative decoding (SD). In normal SD, a small and fast model guesses the next few tokens that a larger slower model may generate, and the large model then verifies them in one forward pass: drafting and verification happen one after the other on the same hardware.

In SSD, they happen in parallel, on distinct hardware. The small model anticipates likely verification outcomes in advance, and speculates for all of them at once. If it guessed correctly, the speculation can be returned immediately so drafting overhead is eliminated entirely.

This custom inference engine supports:

A reference implementation of the SSD algorithm
Optimized SD and autoregressive baselines
Qwen3 + Llama3 model families
Tensor Parallelism
PagedAttention, CUDAgraphs, torch compilation, prefix caching

As a result, SSD achieves up to 2x faster inference than some of the strongest inference baselines in the world.

ssd_demo_final.mp4

Setup

Requirements: Python 3.11+, CUDA >= 12.8. This code was written and tested on H100s.

If uv is not installed:

curl -LsSf https://astral.sh/uv/install.sh | sh
# if `uv` is not found in this shell:
export PATH="$HOME/.local/bin:$PATH"

Then:

git clone https://github.com/tanishqkumar/ssd && cd ssd
uv sync                    # core SSD deps
# uv sync --extra scripts  # add deps used by scripts/
source .venv/bin/activate
python -c "from ssd import LLM; print('ok')"

Set paths via environment variables. SSD_HF_CACHE should point to the HuggingFace hub directory — this is the directory that contains models--org--name/ subdirectories (e.g. /data/huggingface/hub, not /data/huggingface/). SSD_DATASET_DIR should point to the directory containing the dataset subdirectories (humaneval/, alpaca/, etc).

export SSD_HF_CACHE=/path/to/huggingface/hub
export SSD_DATASET_DIR=/path/to/processed_datasets
export SSD_CUDA_ARCH=9.0   # 9.0=H100, 8.0=A100, 8.9=L40/4090

Download models + datasets

If you already have the models downloaded via huggingface-cli or similar, you can skip straight to datasets — just make sure SSD_HF_CACHE points to the right place. The download scripts require the scripts extra: uv sync --extra scripts.

# models (uses SSD_HF_CACHE)
python scripts/download_from_hf.py llama

# datasets (writes to $HF_DATASETS_CACHE/processed_datasets)
export HF_DATASETS_CACHE=/path/to  # parent of SSD_DATASET_DIR
python scripts/get_data_from_hf.py --num-samples 10000

Usage

All commands below run from inside the bench/ directory. Large models (Llama-3 70B, Qwen-3 32B) take a few minutes for load/warmup/compile before generation starts. Always use python -O to disable debug overhead.

Benchmarks

Use --all for full eval across four datasets. Since different data distributions are predictable to varying degrees, the speed of SD/SSD depends a lot on the dataset. Averaging over many prompts from many types of datasets gives an overall picture. --numseqs is per-dataset, so --numseqs 128 --all runs 128 × 4 = 512 prompts total.

cd bench

# AR — Llama 70B, 4 GPUs
python -O bench.py --llama --size 70 --gpus 4 --b 1 --temp 0 --numseqs 128 --output_len 512 --all

# Sync spec decode — 70B target + 1B draft, 4 GPUs, k=6
python -O bench.py --llama --size 70 --gpus 4 --spec --k 6 --b 1 --temp 0 --numseqs 128 --output_len 512 --all

# Async spec decode (SSD) — 70B target (4 GPUs) + 1B draft (1 GPU), k=7, f=3
python -O bench.py --llama --size 70 --gpus 5 --spec --async --k 7 --f 3 --b 1 --temp 0 --numseqs 128 --output_len 512 --all

Use --qwen --size 32 for Qwen models. See bench/bench.py for full args. For SGLang/vLLM baselines, see bench/README.md.

Chat

Interactive streaming chat with Llama-3.1 70B only. Supports AR, sync SD, and async SD (SSD). Pass --metrics to print token count, speed, and TTFT after each response.

cd bench

# AR — 4 GPUs
python -O chat.py --ssd --gpus 4

# Sync spec decode — 4 GPUs, k=6
python -O chat.py --ssd --spec --k 6 --gpus 4

# Async spec decode (SSD) — 5 GPUs, k=7, f=3
python -O chat.py --ssd --spec --async --k 7 --f 3 --gpus 5 --metrics

SGLang and vLLM chat backends are also supported (launches their servers automatically) for comparison:

python -O chat.py --sglang        # spec decode
python -O chat.py --sglang --ar   # autoregressive
python -O chat.py --vllm          # spec decode

Roadmap

Features that will be supported in the near future:

Draft data parallel (increase speculation cache size) on up to 4 devices to avoid getting compute bound
OpenAI-compatible inference over HTTP
New models and MoE support: GPT-OSS and Kimi-K2.5.

Contributions welcome!

Citation

Speculative Speculative Decoding will appear at ICLR 2026.

@misc{kumar2026speculativespeculativedecoding,
      title={Speculative Speculative Decoding},
      author={Tanishq Kumar and Tri Dao and Avner May},
      year={2026},
      eprint={2603.03251},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.03251},
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
bench		bench
scripts		scripts
ssd		ssd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speculative Speculative Decoding

Paper

Setup

Download models + datasets

Usage

Benchmarks

Chat

Roadmap

Citation

History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speculative Speculative Decoding

Paper

Setup

Download models + datasets

Usage

Benchmarks

Chat

Roadmap

Citation

History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages