infer-please

추론해줘 — Local AI Gateway with reranking. Like Vercel AI Gateway, but runs on your machine.

A single OpenAI-compatible endpoint that manages multiple TEI and llama.cpp processes for you. Request any model by name — it starts, serves, and stops automatically.

Why?

Cloud embedding APIs are cheap ($0.02/M tokens). Use them when you can.

But when you need reranking, privacy, or offline — there's no single tool that does it all:

Tool	Embedding	Reranker	Chat	Multi-model	Dynamic loading
Vercel AI Gateway	✅	❌	✅	✅	✅
Workers AI	✅	❌	✅	⚠️ fixed set	❌
HF TEI	✅	✅	❌	❌ 1 per process	❌
Ollama	✅	❌	✅	✅	✅
vLLM	✅	✅	✅	❌	❌
infer-please	✅	✅	✅	✅	✅

infer-please wraps TEI (Rust, Flash Attention, dynamic batching) for embedding/reranking and llama.cpp server for chat — behind one port, with automatic lifecycle management. All backends are external Rust/C++ binaries managed via Bun.spawn().

Quick Start

# Prerequisites
brew install text-embeddings-inference  # or Docker
brew install llama.cpp                  # for chat (optional)

# Install
bun add -g @pleaseai/infer

# Start
infer start
# Server running on http://localhost:3141

Usage

OpenAI SDK (any language)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3141/v1",
  apiKey: "not-needed",
});

// Embedding — first request starts TEI automatically
const embed = await client.embeddings.create({
  model: "BAAI/bge-small-en-v1.5",
  input: ["hello world", "how are you?"],
});

// Different model — new TEI instance spins up
const embed2 = await client.embeddings.create({
  model: "Qwen/Qwen3-Embedding-0.6B",
  input: ["你好世界"],
});

// Chat — routes to node-llama-cpp
const chat = await client.chat.completions.create({
  model: "bartowski/Llama-3.2-1B-Instruct-GGUF",
  messages: [{ role: "user", content: "Hello!" }],
});

Vercel AI SDK

import { embed, embedMany, generateText } from "ai";
import { createInferPlease } from "@pleaseai/infer-ai-sdk";

const infer = createInferPlease(); // defaults to localhost:3141

const { embedding } = await embed({
  model: infer.textEmbeddingModel("BAAI/bge-small-en-v1.5"),
  value: "hello world",
});

const { embeddings } = await embedMany({
  model: infer.textEmbeddingModel("Qwen/Qwen3-Embedding-0.6B"),
  values: ["hello", "world"],
});

const { text } = await generateText({
  model: infer.languageModel("bartowski/Qwen2.5-3B-Instruct-GGUF"),
  prompt: "Explain transformers",
});

Reranking

curl -X POST http://localhost:3141/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-reranker-large",
    "query": "What is deep learning?",
    "documents": [
      "Deep learning is a subset of machine learning",
      "The weather is sunny today",
      "Neural networks have multiple layers"
    ]
  }'

Model Management

# List loaded models
curl http://localhost:3141/v1/models

# Unload a model (frees memory)
curl -X DELETE http://localhost:3141/v1/models/BAAI/bge-small-en-v1.5

Use with QMD

# QMD uses OpenAI-compatible API — just point it at infer-please
export OPENAI_BASE_URL="http://localhost:3141/v1"
export OPENAI_API_KEY="not-needed"
qmd embed && qmd query "search something"

How It Works

Client (OpenAI SDK / Vercel AI SDK / curl)
  │
  │  POST /v1/embeddings   { "model": "BAAI/bge-small-en-v1.5" }
  │  POST /v1/rerank        { "model": "BAAI/bge-reranker-large" }
  │  POST /v1/chat/completions { "model": "...-GGUF" }
  │
  ▼
infer-please (:3141)
  │
  │  Route by model + task
  │
  ├── ONNX model → TEI process (auto-spawned)
  │   ├── BAAI/bge-small-en-v1.5    → :8080
  │   ├── BAAI/bge-reranker-large   → :8081
  │   └── Qwen/Qwen3-Embedding-0.6B → :8082 (started on first request)
  │
  └── GGUF model → llama-server process (auto-spawned)
      └── bartowski/Llama-3.2-1B-Instruct-GGUF → :8090

  ⏱️ Idle 5min → TEI process auto-stopped
  📡 Next request → TEI process auto-restarted (~1s)

When to Use What

Scenario	Recommendation
Embedding only, no infra	☁️ Vercel AI Gateway ($0.02/M tokens)
Embedding only, Cloudflare stack	☁️ Workers AI (~free)
Embedding + Reranking	🖥️ infer-please
Hybrid search pipeline (QMD-like)	🖥️ infer-please
Privacy / air-gapped	🖥️ infer-please
Low latency (<10ms)	🖥️ infer-please
Flexible — cloud first, local fallback	🖥️ @pleaseai/infer-ai-sdk

Popular Models

Embedding

Model	Dims	Speed	Notes
`BAAI/bge-small-en-v1.5`	384	⚡	Good default, English
`BAAI/bge-large-en-v1.5`	1024	⚡	Higher quality
`Qwen/Qwen3-Embedding-0.6B`	—	⚡	Multilingual, 119 languages
`nomic-ai/nomic-embed-text-v1.5`	768	⚡	Open-source, good balance
`jinaai/jina-embeddings-v3`	1024	🐢	Multilingual, not in Ollama

Reranker

Model	Notes
`BAAI/bge-reranker-large`	Good quality, English
`BAAI/bge-reranker-v2-m3`	Multilingual
`Qwen/Qwen3-Reranker-0.6B`	Lightweight, used by QMD

Chat (GGUF)

Model	Notes
`bartowski/Llama-3.2-1B-Instruct-GGUF`	Small, fast
`bartowski/Qwen2.5-3B-Instruct-GGUF`	Multilingual

Configuration

# infer.yaml (optional)
server:
  port: 3141
  host: 127.0.0.1

auth:
  token: secret                    # optional Bearer token

tei:
  runtime: auto                    # auto | native | docker (default: auto)
  imageTag: "1.9"                  # TEI docker image tag (default: "1.9")
  # image: ghcr.io/huggingface/text-embeddings-inference@sha256:...
  # Full image reference override — takes precedence over runtime auto-detect.

models:
  - id: bge-small-en
    type: embedding                # embedding | rerank | chat
    backend: tei                   # tei | llama
    repo_id: BAAI/bge-small-en-v1.5
  - id: bge-reranker
    type: rerank
    backend: tei
    repo_id: BAAI/bge-reranker-large

TEI runtime selection

tei.runtime controls how the TEI backend is launched:

Mode	Behavior
`auto`	Prefer Docker if available; otherwise fall back to the native `text-embeddings-router` binary on `$PATH`.
`docker`	Require Docker. Fails fast at startup if Docker is not running.
`native`	Use the native `text-embeddings-router` binary only. Ignore Docker even when available.

In Docker mode, the image variant is chosen automatically from the host's GPU compute capability and CPU architecture:

NVIDIA compute capability 7.5 → turing-{tag} (experimental)
8.0 → {tag} (Ampere default), 8.6 → 86-{tag}, 8.9 → 89-{tag} (Ada), 9.0 → hopper-{tag}
10.0/12.0/12.1 → Blackwell variants (experimental)
No GPU on darwin-arm64 or linux-arm64 → cpu-arm64-{tag}
No GPU on linux-x86_64 → cpu-{tag}
Volta (7.0) and unknown caps fall back to the CPU variant with a log line explaining why

Set tei.image to pin a specific digest or custom reference — auto-detection is skipped when this is present.

Architecture

infer-please/
├── packages/
│   └── ai-sdk/                # @pleaseai/infer-ai-sdk
│       └── index.ts           #   Vercel AI SDK provider
├── src/
│   ├── index.ts               # CLI entry point
│   ├── server.ts              # Hono + OpenAI-compatible routes
│   ├── tei-manager.ts         # TEI process lifecycle (Bun.spawn)
│   ├── llama-manager.ts       # llama-server process lifecycle (Bun.spawn)
│   ├── router.ts              # Model → backend routing
│   ├── providers/
│   │   ├── tei.ts             # TEI proxy (embed + rerank)
│   │   └── llama-server.ts    # llama.cpp server proxy (chat)
│   └── routes/
│       ├── embeddings.ts      # /v1/embeddings
│       ├── rerank.ts          # /v1/rerank
│       ├── chat.ts            # /v1/chat/completions
│       └── models.ts          # /v1/models
├── package.json
└── tsconfig.json

Tech Stack

Runtime: Bun
Framework: Hono
Embedding/Rerank engine: HuggingFace TEI (Rust, via Bun.spawn())
Chat engine: llama.cpp server (C++, via Bun.spawn())
Client SDK: Vercel AI SDK provider
Language: TypeScript

Roadmap

Development

Running tests

bun run test           # Fast unit/integration tests (no Docker required)
bun run test:e2e       # End-to-end tests against real TEI in Docker

The default test script runs only src/*.test.ts and uses in-process mocks for TEI — fast, Docker-free, safe to run on every commit.

E2E tests

E2E tests live in packages/{server,ai-sdk}/test/e2e/ and exercise the full request path against a real TEI backend running in Docker. They are opt-in: the default bun run test skips them.

Prerequisites:

Docker Desktop (macOS/Windows) or Docker Engine (Linux), running
The first run pulls ghcr.io/huggingface/text-embeddings-inference:cpu-latest (~700 MB) and downloads the sentence-transformers/all-MiniLM-L6-v2 model (~22 MB). Subsequent runs reuse both via the HuggingFace cache at ~/.cache/huggingface.

Running locally:

bun run test:e2e
# or run a single suite:
cd packages/server && RUN_E2E=1 bun test test/e2e/embeddings.e2e.test.ts

If Docker is not running, suites skip with an actionable message instead of failing. In CI (CI=true), a missing Docker is a hard failure.

How it works: E2E tests inject a Docker-backed spawnFn into TeiManager in place of the default Bun.spawn() path. TeiManager's spawn / health-check / idle-timeout / crash-recovery lifecycle runs unchanged, and TeiClient talks to TEI's native HTTP API at the container's mapped port. See packages/server/test/e2e/docker-spawn.ts.

Part of Please Tools

infer-please is part of the Please Tools ecosystem.

License

FSL-1.1-ALv2

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
.please		.please
packages		packages
.gitignore		.gitignore
.nvmrc		.nvmrc
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
README.md		README.md
bun.lock		bun.lock
package.json		package.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

infer-please

Why?

Quick Start

Usage

OpenAI SDK (any language)

Vercel AI SDK

Reranking

Model Management

Use with QMD

How It Works

When to Use What

Popular Models

Embedding

Reranker

Chat (GGUF)

Configuration

TEI runtime selection

Architecture

Tech Stack

Roadmap

Development

Running tests

E2E tests

Part of Please Tools

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

infer-please

Why?

Quick Start

Usage

OpenAI SDK (any language)

Vercel AI SDK

Reranking

Model Management

Use with QMD

How It Works

When to Use What

Popular Models

Embedding

Reranker

Chat (GGUF)

Configuration

TEI runtime selection

Architecture

Tech Stack

Roadmap

Development

Running tests

E2E tests

Part of Please Tools

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages