Introducing Stability Arena and The Seven Metrics of Model Hosting
Why Latency, Throughput, Error Rate, and Cost Are No Longer Enough
TLDR: The evolving landscape of Large Language Model agents requires a fundamental shift in how model hosting infrastructure is measured. To help address this need, we are introducing Stability Arena, a public monitor designed to track Identity, Stability, and Fidelity — three new, essential metrics for the agent era.
Art of the Infrastructure Stack
The LLM infrastructure stack spans hardware, software, and configuration layers—each with dials that affect what the model produces. Tuning this stack is as much art as science.
Each of these layers has implementation choices that affect the model’s output. They are often tuned independently, by different teams, under different pressures, with no single owner in control of the end-to-end quality of what comes out the other side. And with new model architectures arriving at an accelerating pace, every layer must be re-tuned for each new model, often under immediate pressure to go live.
The Need for Model Stability
A year ago, most interactions with large language models happened through chat interfaces. Behavioral variability was a minor friction: a response that felt slightly off, a summary that missed a nuance, etc. Then came improved reasoning capabilities (Opus 4.5+, GPT 5.x+, DeepSeek v3, Kimi K2), models that could leverage tools (MCP, Agent Skills), and harnesses that gave models sandboxes, memory, and the ability to iterate (Claude Code, Codex, Open Claw). These primitives are driving the shift from chat to agents; software that takes on tasks with real autonomy and real consequences.
In the agent era, the same variability that was once a minor annoyance now cascades through tool calls, data interpretations, and downstream decisions. The stakes on stability have fundamentally shifted. The stack hasn’t changed; the consequences of its variability have.
An Expanding Landscape Under Pressure
Model hosting is no longer the sole province of the labs that train the models. The landscape has spread rapidly across distinct categories, each with their own economics and incentives:
Frontier Labs – Anthropic, OpenAI, Google DeepMind, and others, hosting their own models as both research output and commercial product.
Hyperscalers – AWS (Bedrock), Google Cloud (Vertex AI), and Azure, offering model hosting as a managed service within their cloud ecosystems.
Neoclouds and Model Factories – A fast-growing category including companies like Together AI, Baseten, Fireworks, and others purpose-built for inference at scale.
Routers – Services that sit above endpoints and direct traffic across providers, adding another layer of abstraction and potential variability. Examples include Openrouter and Concentrate.ai, and Arena’s Max.
On-Prem and Edge – A burgeoning trend to host models locally—on enterprise hardware, personal Mac Minis, and dedicated inference appliances.
Every participant in this landscape, from $10B private neoclouds to individual developers running models on a Mac Mini, is being asked to change the wheels on the bus while driving it at speed. New models arrive with new architectures. Demand spikes are unpredictable. The immediate metric pressure is relentless: latency, throughput, error rate, cost. There is no end to this in sight.
Even frontier lab endpoints—the model makers themselves—continually struggle with change and the speed of demand required to keep their models running in a stable manner.
Three Missing Metrics
The industry has settled on four established metrics for model hosting: latency, throughput, error rate, and cost. These are necessary but insufficient. They tell you whether the infrastructure is fast, available, and affordable. They do not tell you whether the model coming out the other side is the one you expected, behaving the way you expected, calibrated to the standard you expected.
We propose three additional metrics that the ecosystem needs:
1. Identity
Is this the model you think it is? Identity verification answers a deceptively simple question that becomes urgent across the scenarios enterprises now face:
Which version of which model is being hosted for you, per your contract?
Which model is powering a service that a partner is providing to you?
Which model underlies the agents that touch your working infrastructure?
If you are an agent, what is the engine of the agent you are talking to?
2. Stability
Is the model’s behavior consistent over time at a given endpoint? Stability is a measure of behavioral drift—whether the same tasks produce the same patterns of tool use, reasoning, and output quality from one day to the next. The questions it addresses:
Is the model’s behavioral center being moved at the endpoint, and is that why my application or agent behavior is deviating from expectation?
Is instability why my agent tool calling patterns are changing?
Does a stability shift indicate a security incident, or an infrastructure change, or a deliberate update?
Consistency is not something AI models are traditionally measured for. In a chat application, repetitive responses would feel cold and robotic. In an agentic workflow, consistency is exactly what you need. If the same task is given to an agent, it should use the same tools, data sources, and resources in a predictable manner.
3. Fidelity to Reference
Is the model at this endpoint behaving like the reference version from the maker? This is distinct from stability. A model can be perfectly stable at an endpoint—consistent day after day—and still be calibrated differently from the same model hosted elsewhere. Same name, different behavioral baseline.
This aligns with what model makers have themselves flagged. The same model, served through different endpoints, can behave meaningfully differently—not more or less stable, but calibrated to a different center of gravity. The questions here:
Is this the behavioral centerpoint I contracted for?
How far has this endpoint drifted from the maker’s reference implementation?
If I switch providers, what behavioral shift should I expect?
Zero-Trust Measurements
A critical requirement that applies to the six measurement metrics: they must be independently verifiable. Measured from outside. Not self-reported.
The expanding number of participants, the layered motivations across the stack, and the routing complexity that now sits between a request and a response mean that no single participant should grade their own homework as a load bearing system metric. The same principle applies to financial auditing, food safety, and emissions testing. Independent verification is how trust scales.
This is especially true for identity, stability, and fidelity, which are behavioral metrics—harder to measure, easier to obscure, and impossible for an end user to assess without dedicated tooling. These techniques need to be token efficient since monitoring will be continuous in order to detect changes in a timely manner. Moreover, as these systems scale up in size and complexity, it will be important for agents themselves to have the skills to assess the health of the systems upon which they operate and collaborate with each other.
Stability Arena
To provide a public view of these new metrics, we have begun monitoring leading models across leading endpoints from each category—frontier labs, hyperscalers, neoclouds, and routers.
You can view the monitor at arena.projectvail.com.
To be clear: this is not a report card. It is an early ecosystem map for all participants. Every entity in this landscape is being asked to manage rapid, continuous change in a domain where the rules are still being written. This is the new normal that everyone has to acclimatize to, including application builders and enterprise users of llm models.
For Builders
We would love to connect with application builders who have felt what we felt: that the ground beneath your application is moving in ways you can sense but can’t yet measure. We built this because we wanted real data, not just feelings. If you are building on top of hosted models and want to understand the stability and fidelity of the infrastructure beneath you, reach out.
developers@projectvail.org







