Run A/B tests you can actually trust
Most 'winners' are noise called too early. Honest testing is the whole point.
In short
- Only 20% of 28,304 real experiments ever hit 95% significance, so most 'winners' were called too early.
- Roughly 1 in 7 tests produces a real winner. If a tool says you're winning every time, it's lying.
- Every time you peek at a running test and call it, you raise your odds of a false positive. Set the finish line before you start.
Marker = 95% significance. No winner is called before it.
A "winner" you saw on day one isn't a winner. It's a sample size that hasn't argued back yet. In an analysis of 28,304 experiments run by Convert customers, only 20% ever crossed the 95% significance threshold, which means four out of five tests people called were guesses wearing a percentage sign. The whole job of an honest tool is to tell you when you don't know yet.
What's the problem?
You've been burned by tools that declared a 'winner' early, you shipped it, and conversion didn't actually improve. You can't trust results you can't trust.
Why does this happen?
- Tools call winners before reaching significance.
- Results aren't segmented, so a mobile loss hides inside a desktop win.
- There's pressure to show a quick result over a true one.
- Early data swings hard and then regresses. The first 50 visitors split unevenly by pure chance, so a variant can show +20% on Monday and -3% by Friday. The lift you saw early wasn't real signal; it was the spread you a…
- Stopping a test the moment it 'looks significant' inflates your false-positive rate. Peeking at a running test and calling it whenever the line crosses 95% is the most common way to manufacture fake winners. Every extr…
- Most variants genuinely don't beat the control, and that's the math, not your failure. VWO's own data puts the rate of meaningful winners at roughly 1 in 7 tests. A tool that declares a winner on most of your tests is l…
- Low-traffic stores get strung along the longest. If you're doing a few hundred checkouts a week, a real 5% lift needs weeks of data to confirm, so a tool that promises a fast answer is either ignoring significance or t…
What does the research show?
Independent researchFigures below are from independent studies, not StorePilot data. They're why this problem is worth testing on your own store.
-
In an analysis of 28,304 experiments run by Convert customers, only 20% ever reached the 95% statistical-significance threshold, so most stores never gather enough traffic to call a clear winner.
Convert ↗ -
Only about 1 in 7 A/B tests (~14%) produces a meaningful winning variation that actually lifts conversions, and most variations never beat the original at all.
VWO ↗ -
Better checkout design alone can raise the average large ecommerce site's conversion rate by roughly 35%, so there's real upside worth testing honestly for.
Baymard Institute, E-Commerce Checkout Usability research ↗
How does StorePilot AI fix it?
- StorePilot enforces minimum-traffic and significance thresholds and never declares early winners.
- It reports honestly, including 'Variant B +12% but not enough data yet', and segments by device and visitor type.
- Every result carries a recommended decision (publish B / keep A / split-ship per device) with one clear action.
How do you fix it, step by step?
-
Set the sample size before you launch
Decide up front how many visitors (or conversions) per variant the test needs to detect a realistic lift, based on your current conversion rate and traffic. If you can't reach that number in a few weeks, the test is too ambitious for your traffic; pick a bigger swing or a higher-traffic page.
-
Pick one primary metric and commit to it
Choose revenue per visitor or conversion rate as the single metric that decides the test, before you see any data. Tracking ten metrics and celebrating whichever one turns green is how you find a 'win' in pure noise.
-
Stop peeking. Let it run to the pre-set finish
Don't call the test the moment it crosses 95% on day two. Run it to the sample size and the minimum duration you set (at least one full business cycle, usually 1-2 weeks), so weekday/weekend and payday traffic are all represented.
-
Read the result by segment before you publish
Split the outcome by device at minimum. A +8% desktop win can hide a mobile loss, and since most of your traffic is mobile, the blended number can point you exactly the wrong way.
-
Accept 'no difference' as a valid, useful outcome
If the variants finish statistically tied, that's a real answer: this change doesn't matter, keep the simpler version and move on. Forcing a winner out of a flat result is how you ship changes that quietly do nothing.
-
Ship the winner, then verify it held
After publishing the winning variant, watch the live conversion rate for a couple of weeks to confirm the lift shows up in reality. If it evaporates, the original 'win' was noise and you've just learned to trust the number less, not more.
An illustrative example
Demo data- What StorePilot detects
- A variant looks +12% after a day, but the sample is far too small to trust.
- The fix it builds & tests
- StorePilot holds the call until significance, then recommends a clear decision.
- The projected outcome
- Example: 'Variant B, 94% confidence, +8.4% revenue/visitor, recommend publish.' (Illustrative wording of an honest result.)
Key takeaways
- Only 20% of 28,304 real experiments ever hit 95% significance, so most 'winners' were called too early.
- Roughly 1 in 7 tests produces a real winner. If a tool says you're winning every time, it's lying.
- Every time you peek at a running test and call it, you raise your odds of a false positive. Set the finish line before you start.
- A flat result is a real answer. Don't manufacture a winner out of noise.
This guide is part of the StorePilot cro for shopify playbook. If this is costing you sales, look at Run real CRO tests on a low-traffic store and Stop using one layout for two different audiences next.