# Direct Inference — full reference

> Direct Inference is a zero-knowledge inference endpoint for AI products. You
> swap one base URL, keep the model id your app already sends, and each request is
> classified by its shape and fulfilled on a capable path. The response echoes
> your model id back and omits every serving internal: which model, candidate,
> provider, or version ran is never exposed. The only signal returned about
> routing is the request type. This is the extended companion to
> https://directinference.com/llms.txt.

Base URL: https://app.directinference.com/di/v1
Auth: send your Direct Inference API key as the SDK's API key / Bearer token.

## What it is

Direct Inference is a router with one deliberate difference from every other
router: it does not tell you which model it picked. Transparent routers return
the chosen model id and leave you managing model slugs or task configs; Direct
Inference returns only the request type and echoes your own id back, so your
integration never tracks the model market. Providers and pricing can change
behind the endpoint with zero change to your code.

## Drop-in SDK surfaces

One endpoint speaks three native SDK shapes. Change only the base URL (and key).

OpenAI (Python):

    from openai import OpenAI
    client = OpenAI(base_url="https://app.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY")
    resp = client.chat.completions.create(model="gpt-5.5-mini", messages=messages)
    # resp.model == "gpt-5.5-mini"  (your id, echoed back)

Anthropic (Python):

    from anthropic import Anthropic
    client = Anthropic(base_url="https://app.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY")
    msg = client.messages.create(model="claude-sonnet-4-5", max_tokens=1024, messages=messages)

Gemini (google-genai, Python):

    from google import genai
    client = genai.Client(api_key="YOUR_DIRECT_INFERENCE_KEY", http_options={"base_url": "https://app.directinference.com/di/v1"})
    resp = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)

Streaming, tool/function calling, vision, PDFs, and structured output pass
through on all three.

## Request types

Every call is classified by its shape. Capability always outranks the model name.

- vision     - image content in the request; handled by a vision-capable model.
- document   - PDF or file input; document-capable processing.
- long       - input beyond the standard context window; long-context path.
- code       - tool definitions, diffs, stack traces, repo paths; coding/tool strength.
- json       - a response/output JSON schema is set; a schema-reliable model.
- reason     - multi-step reasoning in the prompt; a reasoning model.
- flash      - simple request at low effort; fast and cheap.
- pro        - everything else (default); a strong all-rounder.

The detected request type is the only routing signal returned, via the
X-DI-Request-Type response header.

## Effort ladder (optional)

Send X-DI-Effort: <level> as a header, ?effort=<level> as a query param, or your
SDK's native reasoning field. Default is medium. Effort biases the serving
choice; request shape still decides the needed capability, and capability
handling (vision/document/long) can promote a call above its effort level.

- minimal - lowest latency, minimal thinking budget, simple work.
- low     - light reasoning, concise answers.
- medium  - balanced default behavior.
- high    - deeper reasoning and more careful synthesis.
- xhigh   - maximum reasoning budget for the hardest requests.

## Model ids

Keep sending the model ids your app already uses - current, legacy, renamed, or
not-yet-released. They are treated as compatibility and intent signals; the
serving model stays hidden, so the id list is a sample, not a menu. Unknown ids
resolve to a capable model instead of erroring.

## Compatibility guarantees

- Your model id is echoed back unchanged; logging/dashboards/evals keyed on it keep working.
- Unknown, legacy, and future ids resolve to a capable model instead of failing.
- Three SDK shapes (OpenAI, Anthropic, Gemini), one base URL.
- Capability outranks the name: a PDF to a "mini" id still gets a document-capable model.
- More than load balancing: capability, quality, cost, latency, health, and error behavior all count.
- Failure handling (rate limits, transient errors, unhealthy paths) is handled inside the endpoint.

## Pricing

Pay per token at the rate of whichever model serves a request; no subscription,
no per-seat fees. A single effort hint biases any call toward latency, cost, or
quality. Representative starting rates per 1M tokens: Fast from $0.40 in / $0.60
out; Balanced (default) from $1.20 in / $4.80 out; Max from $2.40 in / $8.80 out.
Cached input is billed at a reduced rate automatically. Hard per-key and
per-account spend caps are enforced in the request path.

## Observability (your usage, not the model)

The portal shows usage by request type, per-application attribution, per-request
traces (tokens, latency, cost, detected request type), live request-type
classification in the playground, and hard spend caps. The serving model is the
only thing held back.

## FAQ

Q: Which model am I paying for?
A: The tokens of whichever model serves each request. You see the request type, never the specific model.

Q: Do I need to pick an effort level per call?
A: No. Medium is the default; the hint is optional and per-request.

Q: What happens to images, PDFs, or long context?
A: Capability outranks effort. Such requests are promoted to a model that can handle them, even on the Fast preset.

Q: Will a renamed or unknown model id cause an outage?
A: No. Unknown, legacy, and future ids resolve to a capable model.

## Links

- Product: https://directinference.com/
- Why Direct Inference: https://directinference.com/why
- Developers: https://directinference.com/developers
- Pricing: https://directinference.com/pricing
- Security: https://directinference.com/security
- Portal: https://app.directinference.com
- Concise index: https://directinference.com/llms.txt