Skip to content

MachineLearningSystem/ssd

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speculative Speculative Decoding

"In all fictions, each time a man meets diverse alternatives, he chooses one and eliminates the others; in the work of the almost unfathomable Ts'ui Pên, he chooses — simultaneously — all of them."

— Jorge Luis Borges, "The Garden of Forking Paths" (1941)

SSD is a new LLM inference algorithm. It is exact, and it is extremely fast.

SSD is a new type of speculative decoding (SD). In normal SD, a small and fast model guesses the next few tokens that a larger slower model may generate, and the large model then verifies them in one forward pass: drafting and verification happen one after the other on the same hardware.

In SSD, they happen in parallel, on distinct hardware. The small model anticipates likely verification outcomes in advance, and speculates for all of them at once. If it guessed correctly, the speculation can be returned immediately so drafting overhead is eliminated entirely.

This custom inference engine supports:

  • A reference implementation of the SSD algorithm
  • Optimized SD and autoregressive baselines
  • Qwen3 + Llama3 model families
  • Tensor Parallelism
  • PagedAttention, CUDAgraphs, torch compilation, prefix caching

As a result, SSD achieves up to 2x faster inference than some of the strongest inference baselines in the world.

ssd_demo_final.mp4

Setup

Requirements: Python 3.11+, CUDA >= 12.8. This code was written and tested on H100s.

If uv is not installed:

curl -LsSf https://astral.sh/uv/install.sh | sh
# if `uv` is not found in this shell:
export PATH="$HOME/.local/bin:$PATH"

Then:

git clone https://github.com/tanishqkumar/ssd && cd ssd
uv sync                    # core SSD deps
# uv sync --extra scripts  # add deps used by scripts/
source .venv/bin/activate
python -c "from ssd import LLM; print('ok')"

Set paths via environment variables. SSD_HF_CACHE should point to the HuggingFace hub directory — this is the directory that contains models--org--name/ subdirectories (e.g. /data/huggingface/hub, not /data/huggingface/). SSD_DATASET_DIR should point to the directory containing the dataset subdirectories (humaneval/, alpaca/, etc).

export SSD_HF_CACHE=/path/to/huggingface/hub
export SSD_DATASET_DIR=/path/to/processed_datasets
export SSD_CUDA_ARCH=9.0   # 9.0=H100, 8.0=A100, 8.9=L40/4090

Download models + datasets

If you already have the models downloaded via huggingface-cli or similar, you can skip straight to datasets — just make sure SSD_HF_CACHE points to the right place. The download scripts require the scripts extra: uv sync --extra scripts.

# models (uses SSD_HF_CACHE)
python scripts/download_from_hf.py llama

# datasets (writes to $HF_DATASETS_CACHE/processed_datasets)
export HF_DATASETS_CACHE=/path/to  # parent of SSD_DATASET_DIR
python scripts/get_data_from_hf.py --num-samples 10000

Usage

All commands below run from inside the bench/ directory. Large models (Llama-3 70B, Qwen-3 32B) take a few minutes for load/warmup/compile before generation starts. Always use python -O to disable debug overhead.

Benchmarks

Use --all for full eval across four datasets. Since different data distributions are predictable to varying degrees, the speed of SD/SSD depends a lot on the dataset. Averaging over many prompts from many types of datasets gives an overall picture. --numseqs is per-dataset, so --numseqs 128 --all runs 128 × 4 = 512 prompts total.

cd bench

# AR — Llama 70B, 4 GPUs
python -O bench.py --llama --size 70 --gpus 4 --b 1 --temp 0 --numseqs 128 --output_len 512 --all

# Sync spec decode — 70B target + 1B draft, 4 GPUs, k=6
python -O bench.py --llama --size 70 --gpus 4 --spec --k 6 --b 1 --temp 0 --numseqs 128 --output_len 512 --all

# Async spec decode (SSD) — 70B target (4 GPUs) + 1B draft (1 GPU), k=7, f=3
python -O bench.py --llama --size 70 --gpus 5 --spec --async --k 7 --f 3 --b 1 --temp 0 --numseqs 128 --output_len 512 --all

Use --qwen --size 32 for Qwen models. See bench/bench.py for full args. For SGLang/vLLM baselines, see bench/README.md.

Chat

Interactive streaming chat with Llama-3.1 70B only. Supports AR, sync SD, and async SD (SSD). Pass --metrics to print token count, speed, and TTFT after each response.

cd bench

# AR — 4 GPUs
python -O chat.py --ssd --gpus 4

# Sync spec decode — 4 GPUs, k=6
python -O chat.py --ssd --spec --k 6 --gpus 4

# Async spec decode (SSD) — 5 GPUs, k=7, f=3
python -O chat.py --ssd --spec --async --k 7 --f 3 --gpus 5 --metrics

SGLang and vLLM chat backends are also supported (launches their servers automatically) for comparison:

python -O chat.py --sglang        # spec decode
python -O chat.py --sglang --ar   # autoregressive
python -O chat.py --vllm          # spec decode

Roadmap

Features that will be supported in the near future:

  • Draft data parallel (increase speculation cache size) on up to 4 devices to avoid getting compute bound
  • OpenAI-compatible inference over HTTP
  • New models and MoE support: GPT-OSS and Kimi-K2.5.

Contributions welcome!

Citation

Speculative Speculative Decoding will appear at ICLR 2026.

@misc{kumar2026speculativespeculativedecoding,
      title={Speculative Speculative Decoding},
      author={Tanishq Kumar and Tri Dao and Avner May},
      year={2026},
      eprint={2603.03251},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.03251},
}

History

Star History Chart

About

A lightweight inference engine supporting speculative speculative decoding (SSD).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%