One endpoint. Every frontier capability.

One endpoint that does everything frontier models can do

Direct Inference is the endpoint that does it all. Send your request with the OpenAI, Anthropic, or Gemini SDK you already use and a frontier-grade answer comes back — the best available model served for you. No model to choose, no retries to configure, no bill to babysit. Your code never changes when the market does.

  • No model to choose
  • No retries to configure
  • No gateway to set up
  • No overspend
  • It just works
Zero-knowledge · Encrypted in transit & at rest · Hard spend caps · Never trained on your data
quickstart.py
from openai import OpenAI
client = OpenAI(
base_url="https://app.directinference.com/di/v1",
api_key=DI_API_KEY,
)
response = client.chat.completions.create(
model="gpt-5.5-mini",
messages=messages,
)
No rewrites — drop your client onto the DI endpoint.

Trusted by teams shipping AI in production

Engineering and product teams send their hardest traffic to one Direct Inference endpoint — and never think about which model served it.

Brookline FinancialHalcyon FreightMeridian Health SystemsCartographer AINorthwind LabsLattice RoboticsVantage SecuritiesCorewavePetalsoftAuric FinancialStrateraHelix BioworksCobalt SystemsFinchley Pay

8.4B+

tokens served every day

99.98%

uptime across the last 12 months

820ms

median time to first token

2,400+

teams building on Direct Inference

Usage is aggregated across all request types. Which model, provider, or version served any individual request stays private — to your users and to your own logs.

1

endpoint

The lowest-friction way to utilize AI

Bring the SDK shape and model id your app already uses. Direct Inference collapses provider choice, model selection, capability handling, failover, and optimization into a single surface.

$0

past your cap

Hard billing caps

Public AI features need real cost guardrails. Per-key and account-level caps help keep abuse, bugs, and runaway loops from becoming a surprise bill overnight.

The experience gap

The model market is an implementation detail.

Most AI teams are forced to operate the model market: release notes, renames, context windows, vision and document support, reasoning controls, and cost curves. Direct Inference turns that churn into one durable endpoint: your app sends the request, we make sure it lands on a capable path, and the response comes back in the shape you already expect.

Which frontier model should this request use right now?

Will this still work after the next rename, retirement, or launch?

Does this request need vision, documents, long context, tools, JSON, or deeper reasoning?

Can simple traffic stay cheap without making hard requests worse?

How it works

One endpoint replaces the model spreadsheet.

Swap one base URL and send the model id your app already sends. Every request is understood by its shape and fulfilled on a capable path — the caller sees the original model id echoed back.

Before — model policy in your app

# model policy your app should not own
if has_pdf:          model = "document-capable"
elif has_image:      model = "vision"
elif long_context:   model = "long-context"
elif needs_tools:    model = "tool-strong"
elif wants_json:     model = "schema-reliable"
elif is_trivial:     model = "cheap-fast"
else:                model = "frontier"
# ...then update it every time labs ship or rename

After — Direct Inference handles it

# one endpoint; keep your model string
client = OpenAI(
    base_url="https://app.directinference.com/di/v1",
    api_key=DI_API_KEY,
)
client.chat.completions.create(model="gpt-5.5-mini", ...)

Two ways to ship AI

Wire the model market together yourself, or just send the request.

The old way means choosing models, building retries, and chasing prices across providers. Direct Inference does all of it behind one endpoint — the best model is served for you, and only your own model id comes back. It just works.

Choosing the model
Wiring it yourselfYou pick a model per task, keep a matrix current, and migrate every time one is renamed or retired.
Direct InferenceNothing to pick. Send any model id and the best available model is served for you — your id comes back unchanged.
Reliability
Wiring it yourselfYou stand up a gateway and build retry trees, failover, and rate-limit handling for every provider you touch.
Direct InferenceBuilt in. Retries, failover, and retirements are absorbed inside the endpoint, not your app.
Cost
Wiring it yourselfYou hand-tune which traffic goes cheap and watch the bill for runaways.
Direct InferenceSimple work is served cheap automatically, repeats are discounted, and hard caps fail closed.
When the market changes
Wiring it yourselfYour code sees every launch, rename, and price change.
Direct InferenceInvisible. Models move behind one endpoint without touching your integration.

Platform

One integration. Five capabilities you’d otherwise build yourself.

Direct Inference is one zero-knowledge endpoint and a control plane around every request. Five products share a single key, a single base URL, and one promise: your code sends the model id it already uses, and the serving path stays invisible.

DI Endpoint

One endpoint that does everything frontier models can do. Send any OpenAI, Anthropic, or Gemini request and the best available model is served for you — there is no model to choose, ever. Your model id comes back unchanged; the serving path stays private.

DI Reliability

Retries, failover, rate-limit handling, and model retirements are absorbed inside the endpoint. You never build a retry tree or wake up to a deprecated model — outages become recoverable service events, not your incident.

DI Observability

Production-grade visibility into everything except the model: request traces, usage by request type, and per-application attribution derived automatically from your headers.

DI Guardrails

Never overspend. Simple work is served cheap, repeated context is discounted automatically, and hard per-key and per-account caps fail closed before a request is ever dispatched.

DI Enterprise

Everything above, hardened for regulated, high-volume deployments: SSO/SAML, private and VPC delivery, audit logs, contractual SLAs, and volume pricing.

What builders want

The killer feature is removing the model layer from your app.

Developers love one place to try models, switch quickly, cap spend, survive provider churn, and keep the same integration when an experiment becomes production. Direct Inference turns that convenience into an engineered surface for real deployments.

One surface

One endpoint, zero model layer

Bring the SDK shape and model id your app already sends. Provider choice, model selection, capability handling, failover, and optimization collapse into one endpoint.

The model layer leaves your codebase.

Zero switching cost

Swap models without a migration

Audition any flagship, mini, reasoning, or open-family id by changing the model string. The endpoint, SDK, auth, and response shape stay stable.

Model choice becomes policy, not glue.

Unified access

One key for the whole model market

One Direct Inference key spans the OpenAI, Anthropic, and Gemini SDK shapes — no separate provider accounts or per-API quirks to operate.

You never ride one lab's release cycle.

Capability-first

The endpoint decides, so you don’t

Vision, documents, long context, tools, JSON, and reasoning are detected from the request, and the best model for the job is served automatically. Capability outranks the model name.

Hard requests get strength; trivial ones stay cheap.

One knob

Tune cost vs. quality in one hint

A single effort hint biases any call toward latency, cost, or quality — no code rewrite, no model swap — and capability still wins when the request demands it.

One dial instead of a model matrix.

Resilience

When a path breaks, keep serving

Rate limits, retirements, outages, and capability mismatches become recoverable service events behind the endpoint, not bespoke retry trees in your app.

Fewer incidents pushed into your product.

Cost guardrails

Hard caps for public AI

Per-key and account-level spend caps let you expose AI features without leaving your balance open to abuse, bugs, or runaway loops.

Spend fails closed.

Smart economics

Cheap where it can be, cached when it repeats

Simple traffic stays fast and economical, and reused context is billed at a reduced cached rate automatically — no cache plumbing on your side.

Lower blended cost for steady-state work.

It just works because it understands the request

The request decides what capability it needs

A PDF sent to a lightweight model id still gets document handling. Image, long-context, tool, JSON, and reasoning-shaped calls are promoted when needed; ordinary traffic stays fast and economical.

vision

Handled by a vision-capable model — even if the model string says “mini”.

document

Document-capable processing handles the file, regardless of the requested model.

long

Uses a long-context path so nothing gets silently truncated.

code

Code-shaped traffic gets tools-and-reasoning strength.

json

Structured-output requests use a model reliable at schema adherence.

reason

Hard, multi-step problems are sent to a reasoning model.

flash

Trivial traffic stays fast and cheap — where most of your margin hides.

pro

General requests land on a strong all-rounder.

Use-case coverage

36

common AI patterns, one inference surface.

This is the simplicity promise in practical terms: chat, extraction, documents, code, agents, vision, reasoning, and cheap high-volume work all enter through the same endpoint.

Everyday product AI

Fast, polished answers without paying maximum rates for routine traffic.

chat assistantssupport repliessummariesclassificationrewrite / tonetranslation

Structured workflows

When the output contract matters, the request is treated differently.

JSON schemaform extractionentity extractionvalidationtool callsworkflow handoffs

Knowledge and documents

Files and long context promote themselves instead of silently failing.

PDF readingpolicy analysisRAG synthesiscontract reviewmeeting notesresearch briefs

Code and agents

Code-shaped requests get coding, tool-use, and multi-step strength.

code generationdebuggingdiff reviewtest writingrepo navigationagent planning

Vision and multimodal

Images and screenshots trigger vision-capable handling automatically.

image QAscreenshot analysisOCR cleanupchart readingUI inspectiondiagram reasoning

Hard reasoning

Hard calls can spend more thought without moving the whole product to a premium model.

planningmath reasoningdata synthesisrisk reviewlegal-style analysismulti-step decisions

Solutions by industry

Built for the bar your industry sets.

The same endpoint, framed for regulated and high-stakes work — where auditability, data handling, and cost control are not optional.

All solutions

Financial services

Hard spend caps, full request traces, and a serving path your customers can’t see give risk and compliance teams the auditability and cost control they require over every inference.

Healthcare & life sciences

Zero data retention, no-training data handling, and a zero-knowledge endpoint keep PHI-adjacent workloads tightly scoped — with a BAA and private deployment available under DI Enterprise.

Public sector

Private and VPC delivery, audit logs, and a stable endpoint that absorbs model-market churn meet procurement and continuity standards without re-integration every time the catalog moves.

Insurance

Document-capable handling for claims and policy analysis, schema-reliable extraction for structured intake, and per-application attribution across lines of business — under one governed key.

Legal

Contract review and legal-style analysis are served by long-context, reasoning-grade models, with a zero-knowledge contract that keeps matter content off any model-shopping surface.

Enterprise platforms

Embed frontier inference into your own product without exposing which lab powers it — one durable surface your customers consume while the volatile supply side stays on our side.

Frontier-model compatibility

Built for the model ids developers actually send

Keep sending model ids from past and current frontier-lab releases. Legacy, renamed, new, and third-party provider-style ids can resolve to a capable model while your id comes back untouched.

OpenAI

gpt-5.5gpt-5.4gpt-5.4-minigpt-4.1gpt-4oo3

Anthropic

claude-opus-4-8claude-sonnet-4-6claude-haiku-4-5claude-sonnet-4-5claude-3-7-sonnet

Gemini

gemini-3.5-flashgemini-3.1-progemini-3-flashgemini-2.5-progemini-2.5-flash

Open-model families

grokdeepseekkimiqwenllamamistralgpt-oss

A sample, not a menu — the goal is compatibility with the moving frontier-model surface, not another shopping list. The model that served your request stays hidden.

Customer stories

Teams ship faster and spend less from day one.

A logo proves adoption; a number proves a result. Here is what teams measured after moving to one zero-knowledge endpoint.

−58% blended inference spend

Halcyon Freight

Challenge. A homegrown router pinned every workload to one premium model “to be safe,” and re-tuning model choice across vendors had become a recurring engineering tax.

With Direct Inference. Pointed their existing OpenAI SDK at the DI endpoint; the right model and effort tier are served per request — with zero model migrations in 12 months.

Closed a standing InfoSec finding

Brookline Financial

Challenge. Multiple teams called frontier models directly, exposing provider identity in their own logs and tripping a recurring audit finding on uncontrolled third-party data exposure.

With Direct Inference. Consolidated all LLM traffic onto DI’s zero-knowledge endpoint — provider identity hidden, no training on traffic, encryption in transit and at rest, and hard caps that fail closed.

3× faster feature shipping

Cartographer AI

Challenge. A four-person team was burning roughly a third of every sprint maintaining model-selection logic and chasing deprecations across three providers.

With Direct Inference. Deleted the in-house routing layer and replaced it with one DI endpoint plus the effort knob; per-application attribution gave them per-feature cost visibility for free.

What teams say

The model layer left their codebase — and didn’t come back.

We ripped out 600 lines of model-routing glue and a quarterly “which model is cheapest now” spreadsheet. Direct Inference just picks the right model per request and our cost-per-call dropped without us touching a thing. Eight months in, zero migrations and zero 3 a.m. pages about a deprecated model.
Priya Nadkarni VP Engineering, Halcyon Freight
The zero-knowledge contract is what got it through our security review. Our auditors loved that DI never exposes which model served a request and never trains on our traffic — and that spend fails closed at a hard cap. It moved “shadow LLM usage” from an open finding to a closed one.
Marcus Feldt CISO, Brookline Financial
Our team is four engineers. We can’t babysit a model roster across three vendors. DI gave us one endpoint that speaks the OpenAI SDK we already had, and capability handling means a PDF is still served by a document-grade model even when someone hardcodes “mini.”
Dr. Elena Voss Head of AI, Meridian Health Systems
We were spending more time tuning model selection than building product. Now we ship a feature and DI handles the cost/quality trade-off behind one effort knob. Per-application attribution means finance finally trusts the bill, and reliability has been a non-event.
James Okonkwo CTO, Cartographer AI

Operate with confidence

See everything about your traffic — except which model ran.

Zero-knowledge does not mean flying blind. You get a full operational picture of your own usage: the kinds of requests you send, which app sent them, what they cost, and a hard ceiling on spend. The one thing held back is the serving model.

Usage by request type

See how traffic splits across vision, document, long, code, json, reason, flash, and pro — without exposing which model served any of them.

Per-application attribution

Traffic segments by application automatically from your request headers, so one key can power many surfaces and still break down cleanly.

Request traces

Inspect individual requests — tokens, latency, cost, and the detected request type — for the debugging visibility production actually needs.

Hard spend caps

Per-key and account-level ceilings are enforced in the request path. Past the cap, spend fails closed instead of running up a bill.

Live classification

The playground shows the request type each call resolves to in real time, so the decision stays legible even though the model stays private.

Pay-as-you-go balance

Top up with a card and draw it down per request, with a low-balance signal before anything stalls. No seats, no minimum, no contract.

Enterprise-grade security

Most platforms guard everything they collect. We collect almost nothing.

The zero-knowledge contract is the foundation: the model, provider, and version that served a request never enter your product, your logs, or an auditor’s scope. On top of that architecture we run the controls a regulated buyer expects — independently audited, encrypted everywhere, and verifiable in writing.

SOC 2 Type IIISO/IEC 27001HIPAAGDPRCCPA

A finished-feeling surface over a volatile market

Frontier labs ship, rename, and retire models constantly. The DI endpoint stays stable, so each new capable model can improve your integration instead of forcing another migration.

Past and present model ids

Current, legacy, renamed, and new ids are treated as compatibility inputs, not outage triggers. Code written against yesterday's surface can keep working.

No vendor lock-in

One endpoint speaks the OpenAI, Anthropic, and Gemini SDK shapes. We keep the model market behind a stable developer experience.

Backed by

Direct Inference is a venture-backed company. We raised a $27M Series A to build the durable endpoint teams run their production AI on.

Coastal Ridge VenturesInflection Point CapitalLatchford & Co.Northbound Labs

One endpoint. Every frontier model. Zero knowledge of which one ran.

Point your existing OpenAI, Anthropic, or Gemini SDK at Direct Inference and ship today — keep your code, your model id, and your privacy. Start free in minutes, or talk to an engineer about scaling it across your org.

No credit card to start · Hard spend caps on by default · Your prompts are never logged, sold, or used for training.