Pricing
Pay by the token. Keep the experience simple.
No subscription, no per-seat fees. Direct Inference keeps the simple majority of requests economical while preserving frontier-class paths for harder work. A single effort hint biases any call toward latency, cost, or quality — no code rewrite, no model swap — and capability handling promotes a request when it needs more.
Fast
effort: low
$0.60 / 1M output · from
Summaries, classification, extraction, and rewrites — the simple tail of traffic.
Start buildingBalanced
Defaulteffort: medium
$4.80 / 1M output · from
Everyday assistant and product traffic where quality and latency both matter.
Start buildingMax
effort: high
$8.80 / 1M output · from
Hard reasoning, long context, and answers that need frontier-grade quality.
Start buildingFigures are representative starting rates per 1M tokens. You’re billed per token at the rate of whichever model serves a given request; the effort hint biases that choice, and capability handling (vision, document, long context) can promote a call above its effort level when the request requires it.
Enterprise
For platform and engineering teams running production AI at scale — and the security, finance, and procurement partners who sign off on it.
What you actually pay
Your blended cost follows your traffic, not your worst-case model.
Most production traffic is simple. When the simple tail is served fast and cheap and only the hard tail spends frontier-grade capability, the blended rate lands far below sending every request to a top model. Here is an illustrative mix.
| Request mix | Share | In /1M | Out /1M |
|---|---|---|---|
Fast Simple tail — classification, extraction, short chat | 70% | $0.40 | $0.60 |
Balanced Everyday assistant and product traffic | 20% | $1.20 | $4.80 |
Max Hard reasoning, long context, frontier-grade answers | 10% | $2.40 | $8.80 |
| Blended | 100% | $0.76 | $2.26 |
| If every request took the frontier path | $2.40 | $8.80 |
68%
lower input cost
74%
lower output cost
Illustrative only. Shares are an example traffic mix and rates are the representative per-1M figures above; your actual blend depends on your traffic. Capability handling can still promote a request when it needs more — so a cheap simple tail never means a hard request gets shortchanged.
How billing works
Pay-as-you-go, by the token
Top up with a card
Add credit when you need it and draw it down per request. No monthly minimum, no contract to negotiate.
Margin on the simple tail
Trivial requests are recognized and served fast and cheap — so your blended cost drops while the same endpoint stays capable.
Cached input costs less
When a request reuses a prompt prefix, cached input is billed at a reduced rate automatically — no cache plumbing on your side.
Questions
Good to know
Which model am I paying for?
You pay for the tokens of whichever model serves each request. The DI Model is zero-knowledge by design: you see the request type, never the specific model. That lets us keep the developer experience stable while selecting capable, economical models on your behalf.
Do I need to pick an effort level per call?
No. Balanced is the default. The effort hint is optional and per-request, so you can dial a single call toward fast or max without touching the rest of your integration.
What happens to images, PDFs, or long context?
Capability outranks the effort level. A request with an image, a document, or oversized context is promoted to a model that can handle it — even on the Fast preset — so nothing silently fails or gets truncated.
Will a renamed or unknown model id cost me an outage?
No. Unknown, legacy, and future ids resolve to a capable model instead of erroring, so a provider renaming a model does not break your code or your billing.
Ready when you are.
Start building