Entry

You Can't Prove It Worked. Now What?

February 6, 2026

You can't prove your change worked. If you're operating in a high-noise environment, you should stop trying.

I ran algorithm products for a delivery company at 500,000 orders per day across nine countries. Routing, ETA, pricing. Every change I shipped came with the same question from leadership: "How do we know this actually improved things?"

The honest answer was: we don't. Not with certainty. Demand was stochastic. We had 10-minute delivery windows, employed couriers who couldn't be surged, weather shocks that rewrote constraints mid-shift. You can't A/B test routing at city scale when every courier assignment affects every other courier assignment. Treatment contaminates control.

So we stopped trying to prove causality. We started building cases credible enough to act on.

Why Some Environments Resist Clean Measurement

This isn't a methodology failure. It's structural. Three conditions break clean experiments:

Interference. Your treatment affects your control. In delivery, assigning one courier differently changes which couriers are available for every other order. There's no isolated test group. Even companies famous for experimentation culture have teams dedicated to this problem. Lyft and DoorDash both discovered that naive A/B tests led them to wrong conclusions.

Non-stationarity. Conditions change faster than you can measure. Order volume shifts weekly. Rain hits and couriers slow down. You launch a change on Monday, but Monday's conditions don't exist on Tuesday. Comparing before and after means comparing different worlds.

Scale constraints. Even if you tried to solve this with brute force, holding out entire cities as control groups, you'd still face the same problems. Cities differ from each other just like days differ from each other. Different demand patterns, different courier density, different local conditions. You're not comparing treatment versus control. You're comparing City A on its trajectory versus City B on its trajectory, and hoping the difference is your change. It rarely is. And you've paid the cost of degraded service in the holdout city to learn something you still can't be confident about.

If you're in operations, logistics, or marketplaces, you're probably facing at least two of these.

The Skeptic's Trap

There's a stance that sounds rigorous: "If you can't prove it, you can't claim it."

But in environments where proof isn't available, this stance has a consequence: nothing changes. Demanding proof becomes a veto. And the person wielding the veto bears no accountability for the decisions that never got made.

The cost of a wrong decision is visible. You shipped a change, metrics got worse, someone's responsible. The cost of no decision is invisible. You didn't ship the change, metrics stayed flat, nobody knows what would have happened. The skeptic looks prudent. The opportunity cost never shows up in a report.

I've sat in rooms where a proposed improvement got blocked because we couldn't isolate causality. But the same standard was never applied to the status quo. Nobody asked: can you prove the current system is optimal? The burden of proof only applied to change, never to inaction.

This isn't skepticism. It's a policy choice disguised as rigor.

What Credibility Requires

If proof isn't available, what's the alternative? Not guessing. Not gut feel dressed up in charts. The alternative is building a case credible enough to decide on.

The shift: stop asking "can we prove causality?" and start asking "do multiple independent lines of evidence point the same direction?" No single method is conclusive. Convergence is the signal.

Epidemiologists call this triangulation. It's how they established that smoking causes cancer. You can't run a randomized trial where you force people to smoke for 30 years. So they built the case from multiple angles: observational studies, dose-response relationships, biological mechanism research. Each method had weaknesses. But they all pointed the same way. Convergence despite different sources of error is what made the case credible.

The Three Legs

Simulation. Build a synthetic environment where you control everything. Run old config versus new config against the same historical data.

The strength is isolation. You're measuring only what you changed. The weakness is that models aren't reality. Your simulation embeds assumptions. If those assumptions are wrong, your results are wrong in ways that look right. Simulation tells you what should happen if your model is correct. It doesn't tell you if your model is correct.

Comparison groups. Find natural variation in production and exploit it. Stores that got the change versus stores that didn't. Before versus after, controlling for what you can.

The strength is that it's real. Whatever happened, happened. The weakness is confounders. Everything changes all the time. You're never comparing like with like. We built store classification systems for this: group by volume patterns, courier density, demand profiles. Compare within clusters. Never clean, but grounded in what actually occurred.

Case investigation. Pick specific instances and trace them end to end. What did the algorithm decide? Why? What happened in reality?

The strength is mechanism. Aggregate data tells you something changed. Case investigation tells you why. The weakness is selection bias. The discipline is to sample systematically, not chase anomalies. Pick 50 random orders, trace them, see if the mechanism you expect is the mechanism you find.

The principle. If all three legs point the same direction, you have a credible case. If they diverge, you have a problem to investigate. Divergence is information. Convergence is confidence.

How to Present This

The wrong way is to pretend a credible case is proof. "Our analysis shows the new solver reduced costs by 12%" sounds clean, but it invites the wrong challenge.

The right way: "We tested the new config in simulation against six months of historical orders. It showed 15% improvement. In production, comparing similar stores, we saw 11% over eight weeks. That gap is within expected variance. We traced 200 routes and confirmed the mechanism. Here's an example."

This invites challenge on the right things. "Do you think our simulation assumptions are wrong?" is productive. "Can you prove causality?" is a trap that leads nowhere.

One practical note: document your reasoning before you have results. Write down what you expect each method to show. If you construct the narrative after seeing data, you're telling a story. If you predicted the pattern in advance, you're testing a hypothesis.

Conclusion

Demanding proof in noisy environments isn't neutral. It's a policy choice to default to inaction.

The question was never "can you prove it?" The question is "is the case credible enough to act?"

Triangulation isn't a fallback for teams that can't run proper experiments. It's how knowledge works when clean experiments aren't available. It's how the best operations teams ship improvements at scale instead of waiting for certainty that never comes.

Build the case. Show the convergence. Act on it. Then watch what happens and update.

That's not a lower standard than proof. In high-noise environments, it's the only honest standard there is.