Pricing

Pay by the token. Keep the experience simple.

No subscription, no per-seat fees. Direct Inference keeps the simple majority of requests economical while preserving frontier-class paths for harder work. A single effort hint biases any call toward latency, cost, or quality — no code rewrite, no model swap — and capability handling promotes a request when it needs more.

Fast

effort: low

$0.40/ 1M input

$0.60 / 1M output · from

Target latency ~1.2s

Summaries, classification, extraction, and rewrites — the simple tail of traffic.

Start building

Balanced

Default

effort: medium

$1.20/ 1M input

$4.80 / 1M output · from

Target latency ~2s

Everyday assistant and product traffic where quality and latency both matter.

Start building

Max

effort: high

$2.40/ 1M input

$8.80 / 1M output · from

Target latency ~6s

Hard reasoning, long context, and answers that need frontier-grade quality.

Start building

Figures are representative starting rates per 1M tokens. You’re billed per token at the rate of whichever model serves a given request; the effort hint biases that choice, and capability handling (vision, document, long context) can promote a call above its effort level when the request requires it.

Enterprise

Custom/ Contact us

For platform and engineering teams running production AI at scale — and the security, finance, and procurement partners who sign off on it.

SAML / OIDC single sign-on & SCIM provisioning
99.95% uptime SLA with financially-backed credits
Dedicated capacity & reserved throughput
Private, dedicated, or VPC deployment
Org-wide audit logs & SIEM export
Account-wide & per-application spend caps
Volume-based & committed-use pricing
Annual invoicing, POs, and net terms
Signed BAA, MSA, and DPA; SOC 2 report access
Named technical account manager & priority support

What you actually pay

Your blended cost follows your traffic, not your worst-case model.

Most production traffic is simple. When the simple tail is served fast and cheap and only the hard tail spends frontier-grade capability, the blended rate lands far below sending every request to a top model. Here is an illustrative mix.

Request mixShareIn /1MOut /1M

Fast

Simple tail — classification, extraction, short chat

70%$0.40$0.60

Balanced

Everyday assistant and product traffic

20%$1.20$4.80

Max

Hard reasoning, long context, frontier-grade answers

10%$2.40$8.80
Blended100%$0.76$2.26
If every request took the frontier path$2.40$8.80
vs. an all-frontier baseline

68%

lower input cost

74%

lower output cost

Illustrative only. Shares are an example traffic mix and rates are the representative per-1M figures above; your actual blend depends on your traffic. Capability handling can still promote a request when it needs more — so a cheap simple tail never means a hard request gets shortchanged.

How billing works

Pay-as-you-go, by the token

Top up with a card

Add credit when you need it and draw it down per request. No monthly minimum, no contract to negotiate.

Margin on the simple tail

Trivial requests are recognized and served fast and cheap — so your blended cost drops while the same endpoint stays capable.

Cached input costs less

When a request reuses a prompt prefix, cached input is billed at a reduced rate automatically — no cache plumbing on your side.

Questions

Good to know

Which model am I paying for?

You pay for the tokens of whichever model serves each request. The DI Model is zero-knowledge by design: you see the request type, never the specific model. That lets us keep the developer experience stable while selecting capable, economical models on your behalf.

Do I need to pick an effort level per call?

No. Balanced is the default. The effort hint is optional and per-request, so you can dial a single call toward fast or max without touching the rest of your integration.

What happens to images, PDFs, or long context?

Capability outranks the effort level. A request with an image, a document, or oversized context is promoted to a model that can handle it — even on the Fast preset — so nothing silently fails or gets truncated.

Will a renamed or unknown model id cost me an outage?

No. Unknown, legacy, and future ids resolve to a capable model instead of erroring, so a provider renaming a model does not break your code or your billing.

Ready when you are.

Start building