Run A/B tests you can actually trust

Most 'winners' are noise called too early. Honest testing is the whole point.

Reviewed by Misha Gavura, Senior CRO Specialist · EVDEV Top Rated Plus Last updated June 1, 2026

In short

Only 20% of 28,304 real experiments ever hit 95% significance, so most 'winners' were called too early.
Roughly 1 in 7 tests produces a real winner. If a tool says you're winning every time, it's lying.
Every time you peek at a running test and call it, you raise your odds of a false positive. Set the finish line before you start.

Experiment · Honest A/B test Demo

Watching revenue / visitor Day 6 of 12

A · Current –

B · StorePilot variant –

Statistical confidence –

Marker = 95% significance. No winner is called before it.

Gathering data. Not enough yet to call it.

A "winner" you saw on day one isn't a winner. It's a sample size that hasn't argued back yet. In an analysis of 28,304 experiments run by Convert customers, only 20% ever crossed the 95% significance threshold, which means four out of five tests people called were guesses wearing a percentage sign. The whole job of an honest tool is to tell you when you don't know yet.

What's the problem?

You've been burned by tools that declared a 'winner' early, you shipped it, and conversion didn't actually improve. You can't trust results you can't trust.

Why does this happen?

Tools call winners before reaching significance.
Results aren't segmented, so a mobile loss hides inside a desktop win.
There's pressure to show a quick result over a true one.
Early data swings hard and then regresses. The first 50 visitors split unevenly by pure chance, so a variant can show +20% on Monday and -3% by Friday. The lift you saw early wasn't real signal; it was the spread you a…
Stopping a test the moment it 'looks significant' inflates your false-positive rate. Peeking at a running test and calling it whenever the line crosses 95% is the most common way to manufacture fake winners. Every extr…
Most variants genuinely don't beat the control, and that's the math, not your failure. VWO's own data puts the rate of meaningful winners at roughly 1 in 7 tests. A tool that declares a winner on most of your tests is l…
Low-traffic stores get strung along the longest. If you're doing a few hundred checkouts a week, a real 5% lift needs weeks of data to confirm, so a tool that promises a fast answer is either ignoring significance or t…

What does the research show?

Independent research

Figures below are from independent studies, not StorePilot data. They're why this problem is worth testing on your own store.

In an analysis of 28,304 experiments run by Convert customers, only 20% ever reached the 95% statistical-significance threshold, so most stores never gather enough traffic to call a clear winner.
Convert ↗
Only about 1 in 7 A/B tests (~14%) produces a meaningful winning variation that actually lifts conversions, and most variations never beat the original at all.
VWO ↗
Better checkout design alone can raise the average large ecommerce site's conversion rate by roughly 35%, so there's real upside worth testing honestly for.
Baymard Institute, E-Commerce Checkout Usability research ↗

How does StorePilot AI fix it?

StorePilot enforces minimum-traffic and significance thresholds and never declares early winners.
It reports honestly, including 'Variant B +12% but not enough data yet', and segments by device and visitor type.
Every result carries a recommended decision (publish B / keep A / split-ship per device) with one clear action.

How do you fix it, step by step?

Set the sample size before you launch

Decide up front how many visitors (or conversions) per variant the test needs to detect a realistic lift, based on your current conversion rate and traffic. If you can't reach that number in a few weeks, the test is too ambitious for your traffic; pick a bigger swing or a higher-traffic page.
Pick one primary metric and commit to it

Choose revenue per visitor or conversion rate as the single metric that decides the test, before you see any data. Tracking ten metrics and celebrating whichever one turns green is how you find a 'win' in pure noise.
Stop peeking. Let it run to the pre-set finish

Don't call the test the moment it crosses 95% on day two. Run it to the sample size and the minimum duration you set (at least one full business cycle, usually 1-2 weeks), so weekday/weekend and payday traffic are all represented.
Read the result by segment before you publish

Split the outcome by device at minimum. A +8% desktop win can hide a mobile loss, and since most of your traffic is mobile, the blended number can point you exactly the wrong way.
Accept 'no difference' as a valid, useful outcome

If the variants finish statistically tied, that's a real answer: this change doesn't matter, keep the simpler version and move on. Forcing a winner out of a flat result is how you ship changes that quietly do nothing.
Ship the winner, then verify it held

After publishing the winning variant, watch the live conversion rate for a couple of weeks to confirm the lift shows up in reality. If it evaporates, the original 'win' was noise and you've just learned to trust the number less, not more.

An illustrative example

Demo data

What StorePilot detects: A variant looks +12% after a day, but the sample is far too small to trust.
The fix it builds & tests: StorePilot holds the call until significance, then recommends a clear decision.
The projected outcome: Example: 'Variant B, 94% confidence, +8.4% revenue/visitor, recommend publish.' (Illustrative wording of an honest result.)

Key takeaways

Only 20% of 28,304 real experiments ever hit 95% significance, so most 'winners' were called too early.
Roughly 1 in 7 tests produces a real winner. If a tool says you're winning every time, it's lying.
Every time you peek at a running test and call it, you raise your odds of a false positive. Set the finish line before you start.
A flat result is a real answer. Don't manufacture a winner out of noise.

This guide is part of the StorePilot cro for shopify playbook. If this is costing you sales, look at Run real CRO tests on a low-traffic store and Stop using one layout for two different audiences next.

Related guides

Run real CRO tests on a low-traffic store

Low traffic shouldn't trap you in 'not enough data yet' forever. There's a better method.

Stop using one layout for two different audiences

Mobile and desktop shoppers behave differently. One layout can't be best for both.

Optimize for revenue per visitor, not just conversion rate

A higher conversion rate can still mean less money. Revenue per visitor is the honest north star.

Founding-merchant offer

$129/mo Free while we're in founding launch

Fix this on your store, free right now.

Sign up now and StorePilot is free through the end of summer. We set it up on your store, run the first honest test on your real traffic, and don't ship anything without you.

-- days

-- hrs

-- min

-- sec

Free for founding merchants through September 23, 2026.

Free through the end of summer. Everything unlocked: no card, no limits, no catch.
Done-for-you setup. We install and configure StorePilot for your store and catalog.
Expert-reviewed first tests. Misha Gavura checks your first A/B tests by hand before they ship.
A real human, in ~14 minutes. Direct support from the team, not a chatbot.

Founding price, locked for life. When paid plans turn on, you keep a permanent founding rate that never goes up.
Every new feature, included. Founding members are grandfathered into everything we ship next, at no extra cost.
Founding-member priority support. A direct line to the team for as long as you run StorePilot.

Real people, not a black box

Misha Gavura

Senior CRO · EVDEV

Top Rated Plus · Upwork

“I set StorePilot up on your store myself and review your first A/B tests by hand: the setup, the stats, the call, before anything ships. Founding merchants get me directly.”

Plus the full team behind your store

Never miss a revenue leak

We ping you the moment there's a new opportunity worth testing, with the projected dollars. No dashboard to babysit.

Claim your founding spot

No credit card
Fully reversible
Cancel anytime

Founding deal for the first stores to install.

Frequently asked questions

Why won't StorePilot just give me a fast answer?

Because a fast wrong answer costs you real money. StorePilot optimizes for decisions you can trust, with the timeline shown up front.

How long should I run a Shopify A/B test before trusting it?

Run it until it reaches the sample size you calculated AND covers at least one full week (usually one to two weeks minimum) so weekend, weekday, and payday traffic are all in the data. Duration alone isn't enough; a low-traffic store may need several weeks to gather enough conversions to call anything.

What's a good statistical significance level for store experiments?

95% confidence is the standard bar, meaning there's roughly a 5% chance the result is a fluke. Below that you don't have a result, you have a hint, and per Convert's data, only about 20% of real experiments ever even reach 95%.

Can I run an A/B test with low traffic?

You can, but you can only detect large effects. A few hundred sessions a week means small lifts will never reach significance in a reasonable window, so test bold, high-impact changes instead of button colors, or expect to run for many weeks.

Why did my A/B test winner stop working after I shipped it?

Almost always because it was never a real winner. It was called before reaching significance, so the 'lift' was random variation that regressed to the mean once it went live. This is exactly why honest testing holds the call until the numbers settle and then verifies the result post-launch.