Inference, elegantly engineered

Developers

Drop in one endpoint. Get inference that feels engineered.

The DI Model is a stable endpoint for OpenAI, Anthropic, and Gemini SDKs. Change the base URL, keep your existing model string, and each request is understood, classified, and fulfilled for you.

Base URLhttps://app.directinference.com/di/v1

Quickstart

Three steps to your first request

Step 1

Create a key

Step 2

Point at the endpoint

Set your SDK base URL to the DI Model endpoint. Nothing else in the call shape changes.

Step 3

Keep your model string

Send the model id your app already sends. Direct Inference handles capability behind the same response shape.

OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    base_url="https://app.directinference.com/di/v1",
    api_key="YOUR_DIRECT_INFERENCE_KEY",
)

# Keep sending whatever model string your app already sends.
resp = client.chat.completions.create(
    model="gpt-5.5-mini",
    messages=[{"role": "user", "content": "Summarize this thread."}],
)
print(resp.model)  # -> "gpt-5.5-mini" (your id, echoed back)

Anthropic SDK (Python)

from anthropic import Anthropic

client = Anthropic(
    base_url="https://app.directinference.com/di/v1",
    api_key="YOUR_DIRECT_INFERENCE_KEY",
)

# Same endpoint speaks the Anthropic Messages shape too.
msg = client.messages.create(
    model="claude-haiku",
    max_tokens=512,
    messages=[{"role": "user", "content": "Extract the action items."}],
)

Gemini SDK (Python)

from google import genai

client = genai.Client(
    api_key="YOUR_DIRECT_INFERENCE_KEY",
    http_options={"base_url": "https://app.directinference.com/di/v1"},
)

# Keep your Gemini model id — Direct Inference handles the capability.
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Summarize this thread.",
)
print(resp.text)

The effort hint

One optional knob for cost vs. quality

Effort is a hint, not homework. Request shape still decides the needed capability; effort tunes the serving choice within it and keeps cross-provider reasoning controls aligned. Send it as a header, a query param, or your SDK's native reasoning field — medium is the default.

Per-request effort

# Effort is an optional per-request hint. Medium is the default.
resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Plan a database migration."}],
    extra_headers={"X-DI-Effort": "high"},
)

# Or per request via query string:
#   POST https://app.directinference.com/di/v1/chat/completions?effort=minimal

Effort levels

minimal: Lowest latency, minimal thinking budget, simple work.
low: Light reasoning, concise answers.
medium: Balanced default behavior.
high: Deeper reasoning and more careful synthesis.
xhigh: Maximum reasoning budget for the hardest requests.

Request types

What gets detected, and what it triggers

Every call is classified by its shape. Capability always outranks the model name — so a document or image request gets a capable model regardless of the id or effort you send.

Request type	Detected from	What it means
vision	Image content in the request	Handled by a vision-capable model — even if the model string says “mini”.
document	PDF or file input	Document-capable processing handles the file, regardless of the requested model.
long	Input beyond the standard context window	Uses a long-context path so nothing gets silently truncated.
code	Tool definitions, diffs, stack traces, repo paths	Code-shaped traffic gets tools-and-reasoning strength.
json	A response or output JSON schema is set	Structured-output requests use a model reliable at schema adherence.
reason	Multi-step reasoning in the prompt	Hard, multi-step problems are sent to a reasoning model.
flash	Simple request at low effort	Trivial traffic stays fast and cheap — where most of your margin hides.
pro	Everything else (default)	General requests land on a strong all-rounder.

Built for coding agents

Drop Direct Inference into your agent.

Coding tools and agents are first-class clients. Point the base URL at Direct Inference and let machine-readable docs handle the rest — no plugin, no adapter, no bespoke client.

Any OpenAI-compatible tool

# Point any OpenAI-compatible coding tool at Direct Inference.
# (Same idea for Anthropic- or Gemini-compatible tools.)
export OPENAI_BASE_URL="https://app.directinference.com/di/v1"
export OPENAI_API_KEY="YOUR_DIRECT_INFERENCE_KEY"

# Machine-readable for agents:
#   https://directinference.com/llms.txt        (concise)
#   https://directinference.com/llms-full.txt   (full)

Machine-readable docs

A concise /llms.txt and a full /llms-full.txt let a coding agent read the whole integration — base URL, SDK shapes, request types, effort — and wire it up itself.

Point the base URL and go

Any OpenAI-, Anthropic-, or Gemini-compatible coding tool works by changing one base URL. No plugin, no adapter, no bespoke client.

One key across your stack

The same Direct Inference key powers your editor, your agents, and your production app — with usage and caps visible across all of them.

Machine-readable docs:/llms.txt /llms-full.txt

Compatibility

Guarantees you can build on

One line, one key, no rewrite

Point your existing client at one base URL and set your key. Your SDK, your calls, and your logging keep working as-is — there's nothing to re-architect.

No more deprecation fire drills

When a model is renamed or retired upstream, nothing in your code breaks and there's no migration to run — the endpoint keeps serving the same use cases.

Three SDK shapes, one endpoint

Point an OpenAI-, Anthropic-, or Gemini-compatible client at the same base URL — streaming, tool use, and structured output all pass through.

Capability outranks the name

A PDF sent to a “mini” model still gets a document-capable model. The request decides, not the string.

Nothing to configure

Capability, quality, cost, latency, and health are all weighed for you to serve the best available model on every request. There are no rules to write, no routing to tune, and no picker to maintain.

Failure handling is built in

Rate limits, transient provider errors, and unhealthy serving paths can be handled inside the endpoint so your app does not need bespoke retry trees for every model family.

Get a key and send your first request.

Open the portal