Permanence note. This URL is canonical for this methodology version. It will not be redirected, removed, or repurposed once published. Future revisions land at v1-1, v2-0, etc.

Plain-language summary

The Pulse Inference Token Index measures posted, per-token prices for open-weight large-language-model inference at commodity hosts. For each tracked named model, the index publishes the median, P90, minimum, and a dispersion measure across eligible commodity-host endpoints. The headline visualisation is a blended price per million tokens at a 3:1 input-to-output ratio; input-only and output-only series are presented as peer surfaces, not as collapsed details.

The index is structurally distinct from a leaderboard. It is a price index of posted commercial pricing, not a quality benchmark. Where a leaderboard ranks endpoints by latency or output quality, the index records what the market is charging and how that distribution moves over time.

Scope

The index covers commercial endpoints that satisfy all of the following:

Out of scope:

Provider families

The open-weight hosting market segments into four families with materially different economics. The index respects these segments rather than blending across them.

  1. Serverless commodity hosts — pay-per-token, no minimum commitment, auto-scaled shared capacity. Examples: Together AI, Fireworks AI, DeepInfra, Novita, Hyperbolic, Lepton, Nebius AI Studio, Replicate, Parasail. This is the basket the headline index measures.
  2. Speed-tier specialists — premium-priced custom hardware (Groq, Cerebras, SambaNova). 5–20× the throughput of standard GPU hosts. Tracked separately as a supplementary series; never folded into the headline.
  3. Hyperscaler-hosted open weights — AWS Bedrock, Azure AI Foundry, Google Vertex AI. Enterprise procurement; prices do not move like a commodity. Reference series only, excluded from the headline.
  4. Aggregators and routers — OpenRouter, Unify. Not hosts themselves; pass-through providers with embedded margin. Candidate data sources, not separate series.

Headline metric

The default visualisation on each series is a blended price per million tokens at a 3:1 input-to-output ratio. This matches industry convention (OpenRouter, Artificial Analysis leaderboards) and allows cross-provider comparison in a single number.

The 3:1 blend is a readability convention, not the ground truth of any specific workload. Input-only and output-only series are methodologically primary: they are published alongside the blend, cited first in this methodology, and used as the basis for any analysis where the true input/output mix is known. Each series page must surface input-only, output-only, and blended as peer surfaces — not buried in a details expander or an API-only field.

Statistical treatment

For each series on each assessment date, Pulse calculates and publishes:

The median is unweighted across all eligible endpoints. This is explicitly framed as a breadth-weighted market read — what posted pricing actually reveals — not a usage-weighted market price. A small host counts the same as a large one. A usage-weighted variant is conditional on provider-side traffic data and is not part of v1.0.

Quantization handling

Each endpoint's quantization is recorded. For models where both FP16 and FP8 are widely available (currently Llama 3.3 70B, Llama 4 Maverick, DeepSeek V3, Qwen 3 235B), the index publishes separate series per quantization. Mixing quantizations in a single headline number hides a real product difference: FP16 and FP8 differ measurably in output quality and roughly halve in cost-to-serve on the same hardware.

Reference checkpoint pinning

Open-weight flagship models increasingly ship multiple checkpoints under the same commercial name (DeepSeek V3, V3-0324, V3.1, V3.2, etc.). Pooling endpoints by commercial name in this situation produces a dispersion series that is partly a model-upgrade series in disguise.

Within every named model, a single reference checkpoint is pinned for the headline series. Endpoints serving a non-pinned checkpoint are eligible for the underlying database and for a separate, clearly labelled "all versions" research series, but are excluded from the headline series and from any class-series composite.

Pin selection. The pinned checkpoint is the one with the most eligible commodity-host endpoints (per the eligibility rubric below) that has been in commercial hosting for at least 30 days. Ties go to the most recent checkpoint by provider-declared release date.

Re-pinning. Pins are reviewed at each calendar-quarter boundary (1 January, 1 April, 1 July, 1 October). A re-pin is executed only when the candidate has held five or more eligible commodity-host endpoints for 30 consecutive days at review time. When a re-pin occurs, the prior-checkpoint series continues for 90 days as a legacy reference with a "superseded by {new pin}" banner.

Launch pins. Llama 3.3 70B Instruct is pinned to the single active checkpoint. DeepSeek V3 is pinned to V3.2. Other launch pins are set at v1.0 publication.

Eligibility rubric

An endpoint contributes to a Pulse token series if and only if it meets all of the following. The standard is enforceability, not absolute certainty; ambiguous cases are excluded by default and captured in the raw layer for research.

  1. The provider explicitly represents the endpoint as serving the named base weights, with no disclosed fine-tune or distillation.
  2. The quantization is explicitly disclosed and matches the series' declared quantization class.
  3. Pricing is public, self-serve, and available without contact-sales, commitment, or enterprise negotiation.
  4. The quoted price applies to at least the short-context tier (≤32K tokens) that the v1.0 index measures.
  5. The endpoint is stable — not a promotional, launch, or time-limited teaser price. Pricing has been in place for at least 30 days, or is explicitly documented as standard.
  6. Cached-prefix pricing, if offered, is not used to compute the endpoint's headline rate; only the uncached rate enters the index.
  7. The endpoint is generally available (not a private beta, not waitlisted).
  8. Sampling defaults are recorded as metadata. They are not part of the eligibility check unless the provider discloses a materially non-standard restriction (e.g. forced deterministic sampling, enforced system prompts that alter the output distribution, non-standard context truncation). Disclosed non-standard restrictions exclude the endpoint.

Series classification: published vs provisional

An eligible series is classified by the number of eligible commodity-host endpoints on the pinned reference checkpoint:

A provisional series is promoted to published once its eligible-host count has been five or more for 30 consecutive calendar days; a published series is demoted to provisional if its count falls below five for 30 consecutive days, and retired once it falls below three on the same window. The 30-day stabilisation window prevents a single provider addition or removal from flipping the label back and forth.

Basket policy

Entry. A named model enters the tracked basket when at least five serverless-tier hosts publish a price for it and it has been commercially available for 60 days. The 60-day lag avoids launch-week pricing noise.

Retirement. A named model is retired from the active basket when fewer than five serverless hosts still serve it. Historical data is retained; the series is marked inactive rather than deleted.

Replacement. When a model family ships a successor (e.g. Llama 3.3 → Llama 4), both series run in parallel for at least 180 days so users have a continuous reference.

v1.0 basket

SeriesNamed modelQuantizationPin notes
Pulse Llama 3.3 70B BlendedLlama 3.3 70B InstructFP8 (headline) · FP16 (supplementary)Single active checkpoint.
Pulse Llama 4 Maverick BlendedLlama 4 MaverickFP8Pin to be set at v1.0 launch once spot-check completes.
Pulse DeepSeek V3 BlendedDeepSeek V3 (US-hosted)FP8Pinned to V3.2 at v1.0 launch.
Pulse Qwen 3 235B BlendedQwen 3 235BFP8Pin to be set at v1.0 launch.
Pulse Llama 3.1 8B BlendedLlama 3.1 8BFP16Single active checkpoint.
Pulse Mixtral 8x22B BlendedMixtral 8x22BFP16Legacy MoE reference; pin set at v1.0 launch.

Plus one supplementary Pulse Speed-Tier Inference series covering Groq, Cerebras, and SambaNova. Speed-tier endpoints are never folded into the headline.

Jurisdictional segmentation

DeepSeek-family models are hosted both on Chinese-jurisdiction endpoints (DeepSeek's own API and any other PRC-domiciled host) and on US-jurisdiction commodity hosts. These are not comparable products for Western enterprise customers — data residency, compliance posture, and export-control exposure differ materially — and procurement teams treat them as distinct SKUs.

The headline DeepSeek series is US-hosted only. Chinese-jurisdiction endpoints are published as a separate, clearly labelled jurisdictional series, never blended into the US-hosted headline. The same principle applies prospectively to any future model where jurisdictional hosting creates a material product difference.

The price spread between jurisdictions is published as its own time series — the "DeepSeek jurisdictional gap" — and is allowed to expand, contract, or cross zero without changing the methodology. Narrative copy on the series page does not state or imply a fixed multiplier.

Reference cost band

For each tracked model, Pulse maintains a reference hardware target (e.g. Llama 3.3 70B FP8 → 8×H100-SXM5 node; DeepSeek V3 → 8×H200 or 8×MI300X). Given the Pulse median rental rate for that target, a published throughput assumption (tokens/second per node), and a utilisation range, a reference cost band is computed as:

$/M tokens  =  (hourly GPU cost × 1,000,000) / (tokens-per-second × 3,600 × utilisation)

Three points on the band are published:

Throughput source precedence. In order: (1) Pulse's own evaluation runs where available, conducted under documented conditions; (2) named third-party benchmarks that publish reproducible test conditions; (3) provider-disclosed figures, used only where the prior two are silent and explicitly flagged as provider-reported. Every throughput reference number is published with its source so readers can substitute their own assumption.

The band is explicitly labelled a modelled reference, not an observed price. Three lines rather than one make the modelling honestly contestable. The band is recomputed when the underlying GPU rental rate changes; a change to a throughput or utilisation assumption triggers a restatement of the cost band, not of the market-observed series.

Update cadence

Weekly for v1.0, with intraday collection retained internally so the cadence can move to daily once the pipeline is validated against multiple weeks of clean operation. Serverless endpoint prices step-change rather than drift — providers reprice discretely, often with press announcements — so weekly cadence is sufficient to capture the signal at v1.0.

Data quality and restatement

Each collected data point passes the eligibility rubric above before it is admitted to a series. Anomalies (a published price that deviates by more than 50% from the current series median) are flagged for review. Flagged endpoints are included unless determined to be a data-quality issue — API error, stale data, mis-attributed checkpoint — at which point the exclusion is logged on the public corrections page.

Restatements follow the Pulse restatement policy: errors are corrected promptly, the correction is logged with the original and restated values, the affected dates are listed, and the methodology version is preserved.

Source code

The implementing code lives in the public repository: https://github.com/pulsebenchmarks. Each tagged release of the repository corresponds to a methodology version; the v1.0 release implements this page.

Changelog

v1.0 — Initial published methodology. Defines scope, provider families, headline 3:1 blended metric with input/output as peer surfaces, breadth-weighted median, quantization separation, reference checkpoint pinning, eight-point eligibility rubric, published-vs-provisional series classification, basket entry/retirement/replacement, jurisdictional segmentation with the DeepSeek jurisdictional gap as its own series, reference cost band derived from the Pulse GPU rental index, and weekly update cadence.

Cite this version