Pilot programs often demonstrate promising results, but scaling decisions are rarely based on enthusiasm alone. Commissioners and funders must determine whether outcomes are attributable to the model itself, whether those outcomes can be sustained at scale, and whether risks are understood and controlled. This article sits within Scaling What Works and connects closely to commissioning logic in Integrated Funding Pilots, focusing on how evidence must evolve before a model is expanded.
Why pilot evidence breaks down at scale
Pilots are usually resource-intensive, closely supervised, and selectively staffed. Outcomes achieved under these conditions do not automatically translate into system-level impact. Commissioners therefore look beyond headline improvements to understand whether the model will perform when staffing ratios change, leadership attention diffuses, and client complexity increases.
Scaling decisions hinge on whether evidence demonstrates causality, not just correlation, and whether performance holds when operational conditions become less favorable.
System expectations leaders must meet
Expectation 1: Clear attribution between the model and outcomes
Commissioners expect providers to explain why outcomes improved and which elements of the model caused the change. This requires linking outcomes to specific practices—such as escalation thresholds, follow-up cadence, or coordination mechanisms—rather than presenting results as an undifferentiated success.
Expectation 2: Evidence that outcomes remain stable under pressure
Oversight bodies increasingly look for evidence of performance during stress conditions: staffing gaps, demand surges, or high-risk subgroups. A model that only performs under ideal conditions is unlikely to be scaled.
Moving from pilot metrics to scale-ready outcomes
Scale-ready evidence reframes metrics around system value. Instead of asking “Did the pilot work?”, commissioners ask “What problem does this solve at scale, for whom, and at what cost or risk?” This requires outcome frameworks that incorporate baselines, counterfactuals, and risk adjustment.
Leaders preparing for scale should align outcomes with system priorities such as avoided acute use, improved continuity, equity of access, and reduced downstream cost volatility.
Operational example 1: Establishing outcome attribution through pathway-linked metrics
What happens in day-to-day delivery: The service maps each intended outcome to specific workflow steps. For example, reduced ED use is linked to risk stratification completion, escalation timeliness, and post-escalation follow-up. Data dashboards show not only outcomes but also whether the upstream steps occurred as designed. Supervisors review cases where outcomes failed to determine which pathway step broke down.
Why the practice exists (failure mode it addresses): Pilots often report outcomes without showing how they were achieved, making it impossible to know which parts of the model matter.
What goes wrong if it is absent: Outcomes appear positive but cannot be replicated. When performance declines at scale, leaders cannot identify which elements need reinforcement.
What observable outcome it produces: Commissioners can see a defensible chain of causality linking delivery practices to outcomes, increasing confidence that results will hold when expanded.
Operational example 2: Using comparator baselines to test scalability
What happens in day-to-day delivery: The program defines a comparator group—either historical, geographic, or matched by risk profile—and tracks outcomes side by side. Analysts adjust for differences in acuity, housing instability, or comorbidities. Results are reviewed quarterly to assess whether improvements persist as volume increases.
Why the practice exists (failure mode it addresses): Without comparators, improvements may reflect external trends rather than the model itself.
What goes wrong if it is absent: Scaling decisions are made on misleading signals, leading to investment in models that do not outperform existing practice.
What observable outcome it produces: Evidence shows not only improvement, but improvement relative to what would likely have happened anyway.
Operational example 3: Demonstrating outcome stability across risk and equity groups
What happens in day-to-day delivery: Outcomes are segmented by risk tier, language need, housing status, and rurality. Performance dashboards flag widening gaps as scale increases. When disparities emerge, leaders adjust staffing, outreach methods, or partner interfaces and track whether equity indicators recover.
Why the practice exists (failure mode it addresses): Models often perform well for lower-risk groups but degrade for those with higher complexity as volume grows.
What goes wrong if it is absent: Scaling amplifies inequities, creating political and ethical risk and undermining commissioner confidence.
What observable outcome it produces: Evidence shows that scaling does not trade equity for efficiency, supporting system-wide adoption.
What commissioners ultimately want to know
Before scaling, commissioners want assurance that outcomes are real, repeatable, and resilient. Providers that can demonstrate attribution, comparative performance, and stability under pressure move from “promising pilot” to “system solution.”