Your A/B Test Won. But Did It Actually Win?

Mar 24

By jason thompson | 33 Sticks

Here's a scenario. An airline runs a mobile checkout optimization test. The hypothesis is clean. Fewer steps, fewer form fields, more bookings. It's like out of an "A/B Best Practices" handbook.

They simplify the payment flow. Strip out friction. Removed anxiety. All the things that produce winning variations. And what do you know, the variant wins in about two weeks. It's statistically significant, strong effect size, the whole package. Mobile revenue goes up by an estimated $6.2 million annually.

The optimization team is thrilled. The product team is thrilled. The CMO sees the dashboard and is definitely thrilled. Test gets shipped to 100%.

Everybody moves on to the next test.

The number nobody checked.

Six weeks later, someone on the loyalty team notices enrollment numbers are soft. Not catastrophic so, just consistently below forecast, month after month, by a margin that's hard to explain with seasonality.

They dig in. The culprit is the checkout simplification. One of the "friction" steps that got removed was the loyalty enrollment prompt. It wasn't a hard gate, just an inline offer during checkout. Sign up, earn miles on this booking, takes ten seconds. But in the streamlined variant, it was gone.

The math is ugly. About 41,000 fewer loyalty enrollments per year.

$6.2 million versus...what?

This is where most optimization teams hit a wall. They have an extremely precise number on one side, that $6.2M in incremental mobile bookings, and a vague, uncomfortable question on the other side, "What's a loyalty enrollment actually worth?"

It's not zero. Airlines have entire business units built on the premise that loyalty members fly more often, book directly instead of through OTAs, carry branded credit cards, and forgive operational failures that would send a casual customer to a competitor. The lifetime value gap between a loyalty member and a non-member is substantial. Everyone in the building knows this.

But "substantial" doesn't show up in a test results dashboard. What shows up is the $6.2M.

So the test stays shipped. And the loyalty team quietly adjusts their forecast downward and starts looking for other enrollment channels to make up the gap.

The measurement blind spot.

i'm not writing this to argue the test was wrong. Maybe $6.2M in immediate revenue genuinely outweighs 41,000 loyalty enrollments. That's a reasonable position to hold if you've done the math.

The problem is that almost nobody does the math. And i mean specifically this math, what is the expected downstream value of the thing you just traded away?

Most A/B testing programs are built to measure one direction, the variant's impact on the primary metric. Bookings went up. Revenue went up. Conversion rate went up. The test won. But "won" assumes the primary metric is the only thing that moved, and that everything else stayed constant.

It almost never does.

When you change a checkout flow, you're not just changing conversion rate. You're changing the mix of what happens during conversion. Which upsells get seen. Which enrollment prompts get triggered. Which post-purchase flows get initiated. Every simplification removes something, and the thing you removed had a reason for being there, even if that reason belonged to a different team's KPI.

The LTV problem.

The fundamental issue is that A/B testing platforms measure transactions, not relationships. They're very good at telling you whether variant B produced more bookings than variant A over a two-week window. They are terrible at telling you whether the customers who booked through variant B will come back at the same rate, spend at the same level, and engage with the same depth as the customers who booked through variant A.

For the airline, the real question isn't "$6.2M versus 41K enrollments." It's "$6.2M this year versus some unknown amount of revenue over the next five years from people who would have become loyal customers but now won't."

That's not a number you can pull from a testing platform. It requires:

Return frequency data. Do loyalty members actually book more often? By how much? Over what time horizon?
Revenue per trip. Do members spend more per booking than non-members? Do they buy upgrades, seat selections, bags at a different rate?
Channel mix. Are members more likely to book direct versus through an OTA? What's the margin difference?
Retention curves. How long does a loyalty member stay active? What's the decay rate?
Acquisition cost offset. If the enrollment prompt goes away, what does it cost to acquire those same members through other channels — email campaigns, paid media, in-flight prompts?

Most optimization teams don't have access to this data. The data surely exists but often it lives in a different system, owned by a different team, measured on a different cadence. The testing platform says "you won." The loyalty database says something more complicated. And the two systems don't talk to each other.

What this looks like in practice.

i've seen versions of this trade-off at multiple companies, across industries. The specifics change but the pattern is consistent:

A lead generation test removes a newsletter signup from the conversion path. Form completions go up 15%. Marketing loses 40,000 email subscribers a year and has to backfill with paid acquisition at $3-5 per subscriber.

An ecommerce test streamlines the cart page by removing the loyalty program callout. AOV holds steady, conversion rate ticks up. But repeat purchase rate drops by a few points over the next two quarters, and nobody connects it to the test because the test ended months ago.

A subscription service tests a simplified onboarding flow. Trial-to-paid conversion improves. But the users who converted through the simplified flow churn at a higher rate because they skipped the steps that built habit and understanding.

In every case, the test "won" on its primary metric. In every case, there was a secondary cost that was either unmeasured, measured too late, or measured by a team that wasn't in the room when the results were reviewed.

What to actually do about this.

i'm not suggesting you stop simplifying checkout flows or that every test needs a five-year LTV analysis before shipping. That would grind any program to a halt.

But there are practical things an optimization team can do to avoid trading dollars today for dollars tomorrow without knowing it.

Name what you're removing, not just what you're changing. Every simplification test should have an explicit list of what the streamlined variant takes out. Not just "fewer form fields" but which fields, and what did those fields feed? If a field populates a loyalty enrollment, a personalization signal, or a downstream trigger, someone needs to know.

Identify the losers before you ship. When a test wins, ask: is there a team in this building whose metric just got worse? If the answer is "I don't know," that's a problem. You don't need to solve it before shipping, but you need to name it.

Build a secondary metric watchlist. Most testing platforms support secondary metrics. Use them. If your test touches checkout, add loyalty enrollment rate, email opt-in rate, and post-purchase engagement as secondary metrics. You're not optimizing for them, you're watching them. If they move meaningfully in the wrong direction, you have a conversation to have before you ship to 100%.

Run the back-of-envelope LTV calculation. You don't need a perfect model. You need a rough one. If your test kills 41,000 loyalty enrollments and a loyalty member is worth even $175 more than a non-member over their lifetime, that's $7.2M in future value you just traded for $6.2M in current revenue. The rough number is enough to trigger the right conversation.

Close the loop after shipping. Set a calendar reminder for 60 and 90 days post-ship. Check the downstream metrics. If loyalty enrollment recovered because the team found another touchpoint, great. If it didn't, the trade-off is real and somebody should be making a conscious decision about whether to keep it.

The real test.

An optimization program's maturity isn't measured by win rate or revenue impact. It's measured by whether the team can answer a simple question: what did we trade away to get that result?

If the answer is "nothing" every single time, you're not optimizing. You're just measuring the part of the equation that makes you look good.

The $6.2M is real. The 41,000 lost enrollments are real too. The question is whether anyone in the room is willing to hold both numbers at the same time and make a decision with their eyes open.

That's the test that actually matters.

jason thompson is the CEO of 33 Sticks, an analytics consulting botique that helps companies understand what their data is actually telling them and what it's leaving out. If your optimization program is measuring wins without measuring trade-offs, let's talk.

jason thompson

Jason is CEO of 33 Sticks, a boutique analytics consultancy specializing in conversion optimization and analytics transformation. He works directly with Fortune 500 clients to maximize their use of data while helping team members reach their potential. He writes about data literacy, critical thinking, and why most "insights" aren't.

Subscribe to Hippie CEO Life for thoughts on doing good work in a world optimized for engagement.

https://www.hippieceolife.com/