Theme Testing · A/B Testing · 2026
Shopify theme A/B testing: how to test a theme without fooling yourself
You can split traffic between two themes with native Rollouts, an app like Shoplift or Convert, or a DIY duplicate-theme setup. Here is when a theme test is worth running, what it takes in traffic, what to measure, and the pitfalls that quietly ruin the read.
You can A/B test a Shopify theme three ways: natively with Shopify Rollouts, with a testing app like Shoplift, Convert, or Intelligems, or with a duplicate-theme split you wire up yourself. Rollouts is the sane default if you are on the Grow plan or higher, and the rest of this playbook covers the two harder questions underneath: whether a whole-theme test is the right unit at all, and whether your traffic can support one.
The splitter used to be the hard part. It no longer is. What still decides whether a theme test earns its runtime is the judgment around it: picking the right unit of test, doing the sample-size arithmetic before launch, measuring revenue instead of clicks, and resisting the dashboard until the test is actually done. That is what this guide is for.
Can you A/B test a theme on Shopify?
TL;DR Yes, three ways: native Rollouts (Grow plan or higher), a testing app, or a DIY duplicate-theme split. Every method routes visitors between your published theme and a copy, then compares what each group spent.
Yes. Shopify allows only one published theme per store, so every testing method works the same way underneath: it keeps a modified copy of your theme, assigns each visitor to the live version or the copy, holds that assignment steady, and compares the two groups on revenue.
For years the hard part was the splitter itself. Shopify had no native one, and the free fallback many small stores leaned on, Google Optimize, shut down on September 30, 2023 (Google Analytics Help) with no replacement. That gap closed in 2026 when Shopify shipped Rollouts, its own server-side splitter, with the Spring '26 Edition.
| Route | Who splits the traffic | Cost | Reach for it when |
|---|---|---|---|
| Shopify Rollouts | Shopify's servers, before the page renders | No add-on fee; the experiment needs the Grow plan or higher (about $105/mo, verify live) | You are on Grow or higher, can build the variant theme, and will judge the numbers yourself |
| A testing app | The app (Shoplift, Convert, Intelligems) | Roughly $79 to $399/mo entry tiers (verify live) | You want a significance engine, template-level tests, or money tests alongside |
| DIY duplicate theme | A cookie and redirect script you write | Developer time | You are below the Grow plan, have dev help, and accept the fragility |
When is a theme test actually worth running?
TL;DR Run a whole-theme test when the decision is switch or stay: a redesign, a purchased theme, a migration to a new default. For one section or element, test smaller and get a cleaner read.
A theme test is the largest unit of test Shopify offers. It bundles every difference between two themes into a single comparison, which is exactly right when the decision in front of you is "do we switch themes," and wrong for nearly everything else.
The bundling cuts both ways. When the new theme wins, you learn that the package won, and nothing about which of the fifty changes inside it did the work. When it loses, you learn even less, because one broken template can drag down ten good changes while the scoreboard shows a single blended number. Redesigns that drag conversion down are a familiar merchant story for exactly this reason: nobody split the traffic, the numbers dipped after launch, and there was no way to tell which part of the new design was to blame.
A theme test answers one question: keep this theme or switch. If you need to know why, you picked the wrong unit of test.
Three situations clear the bar. You bought or built a new theme and want a revenue verdict before committing the store to it. A redesign or rebrand is already decided on brand grounds, and you want to know what it costs or earns before it becomes permanent. Or you rebuilt for speed and want proof the faster theme also sells.
Below that bar, go smaller. If your actual question is about the product page, the add-to-cart area, or the collection grid, test that template or section instead: same math, cleaner answer, faster verdict, and the result tells you what to do next. The main guide ranks what to test first, and if the product page is the suspect, start with where product pages leak revenue.
The native way: Shopify Rollouts
TL;DR Rollouts splits traffic between your published theme and a copy on Shopify's servers, ramps from a slice to 100%, and rolls back automatically at the end date. Experiments need the Grow plan or higher, and it reports numbers with no verdict.
Rollouts is Shopify's own splitter, generally available since the Spring '26 Edition (June 17, 2026). It lives under Markets > Rollouts in the admin, and it turned the hardest DIY problem, a clean server-side split, into a slider.
The mechanics: start a rollout and Shopify copies your published theme, so you can edit the variant without touching the live store. A "launch reach" slider controls the split. At 100% you are simply scheduling a release; below 100% the rollout becomes an experiment with a start date, an end date, and traffic divided between the versions. You can ramp gradually, run mutually exclusive experiments so one shopper never lands in two tests at once, and let the end date trigger an automatic rollback (Shopify Help Center).
Because the split happens on Shopify's servers before the page is sent, the assigned theme is there at first paint. No script swaps content after load, so there is no flicker and no speed penalty, the two classic failures of client-side theme testing.
Two catches. The first is the plan gate: scheduling and publishing rollouts works on Basic and up, but the traffic-split experiment needs the Grow plan or higher, about $105 a month, verify live (Shopify Help Center). Guides saying you need "Advanced" predate Shopify's 2026 plan rename. The second catch matters more for this playbook: Rollouts reports conversion rate, average order value, gross sales, and sessions, and documents no statistical-significance test, no winner call, and no revenue per visitor. It splits honestly and judges nothing. You are the significance engine.
One operational footnote worth knowing before your first run: a theme experiment collects no analytics if you launch it immediately, so schedule it, even five minutes out. For the full feature teardown, plan details, and the tool-by-tool comparison, see the complete Rollouts breakdown.
The DIY way: a duplicate theme and a sticky cookie
TL;DR Duplicate the theme, edit the copy, assign each visitor by cookie, route half to the variant, and tag orders by arm. It works in a demo and decays in production: flicker, caching, cookie loss, and analytics wiring are all yours to own.
Before Rollouts existed this was the free method, and forums still recommend it. It can work. Knowing exactly how it breaks is what lets you decide whether it deserves your developer's week.
The recipe has four steps.
- Duplicate your published theme and make the variant changes in the copy, leaving the live theme frozen.
- Get a stable way to serve the copy. The usual route is Shopify's theme preview mechanism, driven by a URL parameter that puts a visitor's session on the unpublished theme.
- Split and stick. A small script or edge function assigns each new visitor to A or B, sets a long-lived cookie recording the arm, and sends the B group through the preview route. The cookie keeps them on the same arm across pages and visits.
- Carry the assignment into your numbers. Push the arm into your analytics as a custom dimension and onto the order (cart attributes work), because Shopify's own reports do not segment by theme.
Now the failure modes, which is really the section you came for.
- The split runs client-side. Redirecting after the page starts loading is visible: visitors see the live theme flash before the variant appears. You have reintroduced flicker and slowed the first page of the B arm.
- Caches serve the wrong arm. Page caches and CDNs remember pages, not cookie assignments, so a B visitor can receive a cached A page. The split degrades silently and nothing errors.
- "Sticky" is a promise cookies do not keep. Cookies expire, get cleared, and never follow a shopper to a second device. Every crossover pollutes both arms, and pollution always pushes the result toward "no difference," so a real winner can read as a tie.
- The analytics wiring is the actual project. If the arm assignment does not survive all the way to the order, you have two traffic groups and no revenue comparison, which is to say no test.
- SEO has rules you now enforce yourself. Google's website-testing guidance allows A/B tests, with conditions: never cloak (Googlebot must be eligible for the same variants humans see), point rel=canonical from any variant URL to the original, use 302 redirects rather than 301s, and remove the test as soon as it concludes (Google Search Central). Rollouts and the serious apps handle this for you. Your script does not.
The verdict: reasonable as a scrappy one-off if you are below the Grow plan, have real developer help, and treat the result as directional. If the decision matters, the apps below cost less than the debugging.
There is also the sequential route: publish the new theme and compare four weeks after against four weeks before. Free, no wiring, but time confounds it, since seasonality, ad swings, and promotions land inside the comparison. Treat it as a sanity check with matched periods, not a test; the low-traffic playbook covers doing that read as honestly as it can be done.
Theme testing tools: Shoplift, Convert, and Intelligems
TL;DR Shoplift is the theme-test specialist with a Bayesian verdict engine. Convert brings a mature stats engine with more wiring. Intelligems is the money-testing tool that runs theme tests too. Prices move, verify live.
If you are below the Grow plan, or you want a tool that judges the result instead of handing you raw numbers, three apps carry most theme testing on Shopify.
Shoplift is built around this exact job: theme-level and template-level tests delivered without flicker, audience targeting, and a Bayesian significance engine that tells you when a result is real instead of leaving you to eyeball a dashboard. Entry is around $99 a month, $74 on annual, metered by visitors (verify live).
Convert is a general experimentation platform rather than a Shopify-native app. You get a mature significance engine and flexible test types, including the split-URL pattern that suits a two-theme comparison. The tradeoffs: it runs client-side with anti-flicker masking rather than a true server-side split, and the wiring is on you. Growth is around $399 a month, $299 on annual (verify live). It earns its fee for teams running a steady program.
Intelligems approaches testing from the money side: price, shipping, offers, and profit, scored on profit and revenue per visitor with real significance. It runs theme and content tests too, and its theme-testing docs sit at the top of Google for "theme testing" as of July 2026. Core is around $79 a month, $59 on annual, with price testing on the $499 tier (verify live). Pick it when money tests are on your roadmap anyway; it is a lot of app for a single theme comparison.
| Tool | Theme-test fit | Split mechanics | Verdict engine | Rough entry price | Watch out |
|---|---|---|---|---|---|
| Shopify Rollouts | Whole theme versions, plus checkout config on higher plans | Server-side, no flicker | None documented: no significance test, winner call, or RPV | $0 add-on; experiment needs Grow plan or higher | You judge the numbers yourself |
| Shoplift | Its specialty: theme, template, and page tests | Theme-native, no flicker | Bayesian significance | ~$99/mo ($74 annual), visitor-metered | Two variants per test; cost climbs with traffic |
| Convert | Split-URL and full-site experiments | Client-side with anti-flicker masking | Mature significance engine | Growth ~$399/mo ($299 annual) | Generalist: you bring the hypothesis and the wiring |
| Intelligems | Theme and content tests alongside money tests | App-managed | Profit and RPV significance | Core ~$79/mo ($59 annual); price testing on the $499 tier | Built for volume; money tests are the point |
Third-party cells reflect public positioning in mid-2026; check each pricing page before deciding. The wider comparison, including tools that do not fit theme tests, is in the full tool breakdown.
What none of them do, Rollouts included, is pick the hypothesis or build the variant theme. Every tool here splits and scores what you hand it. And a verdict engine does not lower the traffic bill, which is where we go next.
What to measure: revenue per visitor, not conversion rate
TL;DR RPV is revenue divided by visitors, the same as conversion rate times AOV. Theme changes move both numbers, so a conversion win can hide a basket loss. Decide on RPV; use everything else as diagnosis.
Theme changes touch everything at once: how products get discovered, how carts get built, how upsells surface. That is exactly why conversion rate alone misreads them, because a new theme can convert more people while shrinking what each buyer spends.
Revenue per visitor (RPV)
Total revenue divided by total visitors, which works out to conversion rate × average order value. It is the one number that catches a theme trading order size for order count.
The arithmetic makes it concrete. Say theme A converts at 2.0% with a $60 AOV: RPV is $1.20. Theme B converts at 2.2%, but its layout buries your bundles and AOV slips to $53: RPV lands around $1.17. B "wins" on conversion rate and costs you money on every thousand visitors, and nothing in a conversion-only dashboard flags it. The long version of this argument is in why RPV beats conversion rate.
Two cautions come with revenue as the scoreboard. Big orders are the first: theme tests run for weeks across the whole store, so a single $2,000 wholesale order landing in one arm can fake a lift. Serious engines winsorize, meaning they cap extreme orders before scoring; if your tool does not, at least inspect the top orders in each arm before believing a result. Noise is the second: revenue varies far more per visitor than a yes-or-no conversion does, so an RPV verdict needs more traffic than a conversion verdict at the same confidence. Treat the sample-size numbers below as a floor.
Everything else your dashboard offers, add-to-cart rate, reached-checkout, bounce, page speed, is diagnosis. Use those to explain why an arm won. Never use them to declare that it did.
The sample-size reality for theme-level changes
TL;DR At a 2% baseline conversion rate, a 10% relative lift takes roughly 78,000 visitors per variant to detect. Run this arithmetic before the test. If the runtime comes out in seasons, change methods.
This part is deterministic arithmetic, not opinion, and it kills more theme tests than bad design does. Before starting anything, push your own numbers through one formula.
A standard approximation for a two-arm test at 80% power and 95% confidence (Lehr's rule) is: visitors per variant ≈ 16 × p × (1 - p) ÷ d², where p is your baseline conversion rate and d is the absolute lift you want to detect. At a 2% baseline, a 10% relative lift means d = 0.002, so the arithmetic gives 16 × 0.02 × 0.98 ÷ 0.002², about 78,400 visitors per variant. Double it for both arms.
| Detectable lift (relative) | Absolute lift (d) | Visitors per variant | Store at 20k sessions/mo | Store at 100k sessions/mo |
|---|---|---|---|---|
| 5% | 0.001 | ~313,600 | ~31 months | ~6 months |
| 10% | 0.002 | ~78,400 | ~8 months | ~7 weeks |
| 20% | 0.004 | ~19,600 | ~2 months | ~12 days (run 2 full weeks anyway) |
Lehr's approximation at 80% power and 95% confidence, conversion-rate read, both arms needed so total traffic is double the per-variant column. Higher baselines need less: at a 3% baseline, a 10% lift needs about 51,700 per variant. RPV reads need more than the table shows.
Theme tests get two breaks that section tests do not. All store traffic counts, since the theme covers every page, where a product-page test only collects visitors who reach that template. And a full redesign is one of the few changes plausibly large enough to clear the 10% to 20% detection bar, where a button tweak realistically sits in the undetectable 1% to 5% range.
Now the weights on the other side of the scale. Across large samples, only about 1 in 7 tests produces a clear winner (VWO, corroborated by Nielsen Norman Group), and about 60% of completed tests deliver under 20% lift (Convert, 2026). So a rational plan sizes the test to detect 10%, hopes for 20%, and accepts that "no detectable difference" is the most likely single outcome. For a switch-or-stay decision that is still an answer: keep whichever theme costs less to keep.
If the table put your store in the months column, do not start a split you will abandon. Test a smaller unit, or use a method built for thin traffic. How significance actually works explains the mechanics behind the formula.
Seven pitfalls that quietly ruin theme tests
Most theme tests fail on mechanics rather than on the design being tested. These seven do most of the damage, and every one is preventable before launch.
1. Flicker
Client-side testers swap content after the page starts rendering, so visitors see the old theme flash before the variant loads. That looks broken to the shopper and biases the B arm with a worse experience than the theme itself would deliver. Server-side splitting avoids it entirely, and DebugBear's measurements of script-based testing dragging largest contentful paint show how large the penalty gets.
2. Cache serving the wrong arm
Page caches and CDNs remember pages, not cookie assignments. In a DIY setup, a visitor assigned to B can still be handed a cached A page, and the split degrades without a single error being logged. QA with fresh sessions on several devices before trusting any homemade split.
3. Sessions crossing variants
Cookies expire, shoppers switch devices, browsers clear storage. Each crossover puts one person's behavior in both arms, and contamination always pulls the measured difference toward zero, so real winners read as ties. You cannot eliminate crossover, only cap it, which is one more reason not to let a test drag on for half a year.
4. Seasonality and a shifting traffic mix
A theme test long enough to reach sample size is long enough to collide with a sale, a new ad campaign, or an email blast that floods one week with loyal buyers. Randomization protects the comparison between arms, but a shifted mix changes what the result generalizes to. Run full weeks, keep acquisition steady, and never span an event like BFCM.
5. Peeking
Watching the dashboard and stopping the moment your favorite arm pulls ahead feels diligent and quietly rigs the test. Evan Miller's classic write-up showed repeated peeking can push a real false-positive rate from the 5% you think you are running to about 26%. Precommit the sample size and the end date, then let the test finish.
6. Editing either theme mid-test
Every edit to either arm changes what the test is comparing, which resets the meaning of all data collected so far. Freeze both themes for the duration and keep a list of the fixes you are itching to make. The urge to tweak is the strongest evidence you should have tested a smaller unit.
7. Mishandling SEO
Merchants make two opposite mistakes here: refusing to test at all out of ranking fear, and building DIY redirects that drift toward cloaking. Google's published position is that testing is fine within the rules listed in the DIY section above: no cloaking, canonical from variants, 302s, end it promptly (Google Search Central). Rollouts and the serious apps comply by design.
The step-by-step theme test playbook
The whole method in nine steps. Nothing here is exotic; the discipline is the point.
- Write the decision down first. One sentence: "If theme B beats theme A on RPV at 95% confidence, we publish B." A test without a precommitted decision becomes a debate afterwards.
- Confirm the unit. Switch-or-stay decision: whole theme. Anything narrower: test the section or template instead and bank a cleaner read.
- Run the sample-size arithmetic from the section above at your real traffic and baseline. If the runtime exceeds roughly two months, stop and change methods now, not at week nine.
- Pick the route. Rollouts on Grow or higher, an app when you want a verdict engine or template-level control, DIY only with developer support and modest stakes.
- Build the variant and freeze both arms. QA the copy on mobile first, and check both themes load at comparable speed; a slower variant tests your hosting, not your design.
- Schedule the launch, never start it immediately (in Rollouts an immediate launch collects no analytics).
- Run full weeks, at least two business cycles, with acquisition held steady. No mid-test theme edits, no mid-test budget swings you can avoid.
- Read RPV with significance. Check mobile and desktop separately as a diagnostic, and inspect top orders in each arm so one whale is not writing your conclusion.
- Ship, clean up, log. Publish the winner, remove test scaffolding promptly per Google's guidance, and write down the result even when it is "no difference," because that settles the decision too.
Where StorePilot fits
StorePilot does a different job than a theme splitter, and a whole-theme test is usually a different question than the one it answers.
A theme test settles switch-or-stay. Most of the revenue a store loses, though, leaks in smaller places: a product page that buries shipping costs, a cart that hides the total, a mobile layout that makes add-to-cart a hunt. StorePilot watches real shopper behavior, ranks that friction, builds section-level variants as theme-safe app blocks with a preview before anything goes live, and scores results on RPV with significance thresholds and no early winners. The winner call is always deterministic statistics, never the AI, a division of labor we argue for in Never let an AI decide your A/B test.
If you are weighing a redesign against fixing the theme you have, the wider playbook in CRO for Shopify makes the case for finding the leaks before repainting the ship. And when a theme test genuinely is the right call, run it the way this guide lays out: pick the splitter, do the arithmetic first, score on revenue per visitor, and let the test finish before you believe it.
Questions merchants keep asking
Can you A/B test themes on Shopify?
Yes, three ways. Shopify Rollouts splits traffic between theme versions natively on the Grow plan or higher, apps like Shoplift or Convert add significance engines and finer targeting, and a DIY duplicate-theme split works with developer help. Every method assigns visitors to your published theme or a modified copy and compares what each group spent.
How do I test a new Shopify theme before publishing it?
Preview answers quality questions; only an A/B test answers revenue questions. Theme preview lets you QA the unpublished theme on real pages, but no traffic is split, so it proves nothing about conversion. To learn whether the new theme sells better, split real visitors with Rollouts or a testing app and compare revenue per visitor.
Is Shopify theme A/B testing free?
Partly. Rollouts has no add-on fee, but the traffic-split experiment is gated to the Grow plan or higher, about $105 a month, verify live (Shopify Help Center). Scheduling and publishing theme changes works on Basic and up. Third-party theme testing apps start around $79 to $99 a month.
Can you run two themes at the same time on Shopify?
Not as two published themes. Shopify allows one published theme per store, so testing tools keep a copy of your theme and route a share of visitors to it, either on Shopify's servers (Rollouts) or through an app or script. Each visitor sees one consistent version for the whole test.
How long should a Shopify theme A/B test run?
Until it reaches the sample size you computed before starting, in full weeks, across at least two business cycles. At a 2% conversion baseline, detecting a 10% relative lift takes roughly 78,000 visitors per variant, which means weeks on a high-traffic store and many months on a small one.
Does A/B testing a theme hurt SEO?
No, if you follow Google's published testing rules: never cloak, point rel=canonical from any variant URL to the original, use 302 redirects rather than 301s, and remove the test promptly once it concludes (Google Search Central, website testing guidance). Server-side splits like Rollouts fit those rules by design.
Should I test the whole theme or one section at a time?
Test the whole theme only when the decision is whether to switch themes. For any narrower question, test the section or template instead. Smaller tests give cleaner reads and tell you exactly what worked, while a whole-theme result can never isolate which of its many changes did the lifting.
Do theme testing apps slow down my store?
Client-side testers can. They swap content with a script after the page starts rendering, which causes flicker and can drag Core Web Vitals; DebugBear measured one site improving from a 6.0-second to a 2.7-second largest contentful paint after moving off script-based testing. Server-side splits avoid the penalty entirely.