# Direct Inference — full reference > Direct Inference is a zero-knowledge inference endpoint for AI products. You > swap one base URL, keep the model id your app already sends, and each request is > classified by its shape and fulfilled on a capable path. The response echoes > your model id back and omits every serving internal: which model, candidate, > provider, or version ran is never exposed. The only signal returned about > routing is the request type. This is the extended companion to > https://directinference.com/llms.txt. Base URL: https://app.directinference.com/di/v1 Auth: send your Direct Inference API key as the SDK's API key / Bearer token. ## What it is Direct Inference is a router with one deliberate difference from every other router: it does not tell you which model it picked. Transparent routers return the chosen model id and leave you managing model slugs or task configs; Direct Inference returns only the request type and echoes your own id back, so your integration never tracks the model market. Providers and pricing can change behind the endpoint with zero change to your code. ## Drop-in SDK surfaces One endpoint speaks three native SDK shapes. Change only the base URL (and key). OpenAI (Python): from openai import OpenAI client = OpenAI(base_url="https://app.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY") resp = client.chat.completions.create(model="gpt-5.5-mini", messages=messages) # resp.model == "gpt-5.5-mini" (your id, echoed back) Anthropic (Python): from anthropic import Anthropic client = Anthropic(base_url="https://app.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY") msg = client.messages.create(model="claude-sonnet-4-5", max_tokens=1024, messages=messages) Gemini (google-genai, Python): from google import genai client = genai.Client(api_key="YOUR_DIRECT_INFERENCE_KEY", http_options={"base_url": "https://app.directinference.com/di/v1"}) resp = client.models.generate_content(model="gemini-2.5-flash", contents=prompt) Streaming, tool/function calling, vision, PDFs, and structured output pass through on all three. ## Request types Every call is classified by its shape. Capability always outranks the model name. - vision - image content in the request; handled by a vision-capable model. - document - PDF or file input; document-capable processing. - long - input beyond the standard context window; long-context path. - code - tool definitions, diffs, stack traces, repo paths; coding/tool strength. - json - a response/output JSON schema is set; a schema-reliable model. - reason - multi-step reasoning in the prompt; a reasoning model. - flash - simple request at low effort; fast and cheap. - pro - everything else (default); a strong all-rounder. The detected request type is the only routing signal returned, via the X-DI-Request-Type response header. ## Effort ladder (optional) Send X-DI-Effort: as a header, ?effort= as a query param, or your SDK's native reasoning field. Default is medium. Effort biases the serving choice; request shape still decides the needed capability, and capability handling (vision/document/long) can promote a call above its effort level. - minimal - lowest latency, minimal thinking budget, simple work. - low - light reasoning, concise answers. - medium - balanced default behavior. - high - deeper reasoning and more careful synthesis. - xhigh - maximum reasoning budget for the hardest requests. ## Model ids Keep sending the model ids your app already uses - current, legacy, renamed, or not-yet-released. They are treated as compatibility and intent signals; the serving model stays hidden, so the id list is a sample, not a menu. Unknown ids resolve to a capable model instead of erroring. ## Compatibility guarantees - Your model id is echoed back unchanged; logging/dashboards/evals keyed on it keep working. - Unknown, legacy, and future ids resolve to a capable model instead of failing. - Three SDK shapes (OpenAI, Anthropic, Gemini), one base URL. - Capability outranks the name: a PDF to a "mini" id still gets a document-capable model. - More than load balancing: capability, quality, cost, latency, health, and error behavior all count. - Failure handling (rate limits, transient errors, unhealthy paths) is handled inside the endpoint. ## Pricing Pay per token at the rate of whichever model serves a request; no subscription, no per-seat fees. A single effort hint biases any call toward latency, cost, or quality. Representative starting rates per 1M tokens: Fast from $0.40 in / $0.60 out; Balanced (default) from $1.20 in / $4.80 out; Max from $2.40 in / $8.80 out. Cached input is billed at a reduced rate automatically. Hard per-key and per-account spend caps are enforced in the request path. ## Observability (your usage, not the model) The portal shows usage by request type, per-application attribution, per-request traces (tokens, latency, cost, detected request type), live request-type classification in the playground, and hard spend caps. The serving model is the only thing held back. ## FAQ Q: Which model am I paying for? A: The tokens of whichever model serves each request. You see the request type, never the specific model. Q: Do I need to pick an effort level per call? A: No. Medium is the default; the hint is optional and per-request. Q: What happens to images, PDFs, or long context? A: Capability outranks effort. Such requests are promoted to a model that can handle them, even on the Fast preset. Q: Will a renamed or unknown model id cause an outage? A: No. Unknown, legacy, and future ids resolve to a capable model. ## Links - Product: https://directinference.com/ - Why Direct Inference: https://directinference.com/why - Developers: https://directinference.com/developers - Pricing: https://directinference.com/pricing - Security: https://directinference.com/security - Portal: https://app.directinference.com - Concise index: https://directinference.com/llms.txt