Most CRO programmes fail not because the team cannot run an A/B test, but because they are running the wrong tests, in the wrong order, without a framework for learning from the results. The mechanics of splitting traffic are simple. The harder questions are what to test, how many changes to make per experiment, which testing approach suits your traffic level and business model, and how to sequence experiments so each one builds on the last. This article covers all of it.
Test Types vs. Testing Strategy
These are not the same thing, and conflating them leads to poor decisions.
A test type is the technical mechanism: how traffic is split and how variants are compared. A testing strategy is the system around those experiments: the frameworks you operate within, the scope of your programme, how teams are organised, how hypotheses are generated, and what sequence of experiments makes sense given your goals.
The four core test types are A/B (or A/B/N for multiple variants), multivariate, bandit, and split URL. Each is covered in detail below. But choosing the right test type is a downstream decision. The upstream questions, which framework, which problems to address, which metrics matter, are where strategy lives.
It is also worth stating that A/B testing is one narrow slice of experimentation as a discipline. The book Testing Business Ideas by Strategyzer catalogues over 350 pages of methods for testing assumptions before committing resources. Smoke tests (announcing a feature and measuring sign-up intent before building it), five-second usability tests, landing pages for unreleased products, and ad campaign message tests all share the same underlying logic: test the assumption before paying the cost of building something irreversibly.
What to Test
The question “what should we test?” is answered by your research, not by opinion. A/B tests are not ideas. They are proposed solutions to documented problems. Running a test without a documented problem is guessing.
Given that, there is a natural hierarchy of test candidates.
Tier 1: Urgent problems with clear solutions. These come directly from research. Usability studies show users cannot find the CTA. On-site polls reveal confusion about pricing. Analytics shows a sharp drop-off at the shipping step in checkout. The problem is documented; the solution is relatively obvious. Test these first. They produce the fastest wins and the clearest learning.
Tier 2: Creative solutions to non-obvious problems. Once Tier 1 problems are solved, programmes often plateau. No usability issue is detectable, the copy is clear, everything works, but conversion is stuck. Now you need hypotheses that come from understanding psychology and user behaviour rather than from direct observation of friction. A sign-up flow with no obvious problems might still benefit dramatically from adding social login. That hypothesis comes not from the data but from understanding what friction feels like to users even when it is invisible in heatmaps.
Tier 3: Escaping the local maximum. When Tier 1 and Tier 2 testing stop moving the needle, the current page structure has probably reached the ceiling of what iterative improvement can deliver. No individual change will meaningfully shift performance. The only remaining lever is rethinking the layout, flow, or information architecture entirely. This is what Tier 3 tests are designed to address.
How Many Changes to Make Per Test
There is no universally correct answer. The right number of changes per test depends on traffic volume, learning goals, and risk tolerance.
Testing one change at a time gives clean attribution: you know exactly what caused the result. The problem is that small changes produce small effect sizes, and detecting a 0.2% lift with statistical significance can require hundreds of thousands of visitors. If you do not have that traffic, single-variable testing is not viable. It produces tests that never reach significance and programmes that stall.
Testing many changes at once produces larger effect sizes detectable at lower traffic volumes. The trade-off is that you cannot attribute the result to any single change.
The right balance is to make multiple changes per test but constrain them so the test still produces learning. Two approaches work well.
The first is grouping all changes around the same diagnosed problem. If research identifies cognitive overload on the pricing page, the variant simplifies the feature list, removes distracting elements, and rewrites the pricing breakdown. Every change addresses the same thing. A win tells you that reducing cognitive load worked. A loss tells you it did not. Attribution is imprecise but directionally sound.
The second is grouping all changes around the same hypothesis. If the hypothesis is that clearer value proposition copy drives sign-ups, you update the homepage headline, rewrite the product page benefit statements, and tighten the CTA copy across three pages. All 30 changes test one idea. A win confirms the hypothesis; a loss challenges it.
If you run many changes and get a large lift, you can always retroactively isolate what drove it by reverting subsets in follow-up tests. Whether to do that depends on your goals. Sometimes accepting the win and moving to a different problem is the right call.
A/B Testing vs. Multivariate Testing
A/B (and A/B/N) testing is the default choice for most situations. Multivariate testing (MVT) is designed for one specific question: do combinations of elements interact in ways that matter?
A testimonial image paired with “Start Free Trial” may outperform the same image paired with “Get Started.” That interaction effect is invisible in any A/B test. MVT surfaces it. But MVT has a significant cost: testing all combinations of several elements splits traffic across many more variants, requiring substantially more visitors to reach significance. A rough minimum for MVT is 100,000 visitors per month. Below that threshold, tests take so long to run that the results are often unreliable or irrelevant by the time they arrive.
The practical rule is simple: default to A/B testing. Reach for MVT only when you have specific reason to believe interaction effects are important and the traffic to support it.
Bandit Testing
Standard A/B tests hold traffic splits constant (50/50 or 33/33/33) until statistical significance is reached. Bandit tests do not. They dynamically reallocate traffic toward better-performing variants in real time, so if Variant B is outperforming Variant A, more users get Variant B automatically.
The benefit is that you earn revenue during the learning period rather than sacrificing it for clean data. The cost is that you sacrifice clean data.
Bandit testing is appropriate for short-term campaigns where the learning does not carry over. A Black Friday campaign running for seven days needs the best-performing variant to be serving as much traffic as possible by day three. Whether the headline outperforms the urgency banner is not a question with long-term strategic value. You just want the best version running while the campaign is live.
Bandit testing is also well-suited to automation at scale: testing dozens of email subject lines or newsletter CTA variations where individual tests have no strategic significance.
What bandit testing is not suited for is testing fundamental hypotheses about your audience. If you want to know whether social proof outperforms urgency messaging for your product category, you need clean data. A bandit test will skew traffic before significance is reached and may give you a misleading result. Use A/B testing whenever the learning itself is the goal.
Existence Testing
Existence testing is deceptively simple: remove a page element, run an A/B test against the original, and measure the result.
The outcome has three possibilities. Removing the element increases conversion, which means it was a distraction and should be permanently cut. Removing it decreases conversion, which means it was contributing positively and should stay, possibly with more prominence. A result with no significant difference means the element has no measurable impact. It is clutter.
The last outcome is the most common and the most useful. Every piece of content on a page competes for user attention. An element that does nothing for conversion still consumes cognitive bandwidth and often consumes political capital: someone fought to put it there. Existence testing produces the objective evidence needed to remove legacy content without internal conflict.
Good candidates for existence testing include banners that have always been there, sections added by departments to guarantee their message appears on the homepage, award logos or certification badges, introductory copy above the fold, and navigation items that analytics shows receive very little traffic.
Iterative vs. Innovative Testing
Iterative testing is a programme of sequential experiments on the same page or flow, each building on the one before. It is not a single experiment. It is a structured approach to progressively improving a specific part of the experience.
Iterative testing works well when specific problems are identified and specific solutions can be designed for them. Each iteration tests a new approach to the same documented problem. It also creates organisational momentum: accessible, well-researched tests produce wins that build confidence in the testing culture.
There is one concept in iterative testing that matters more than any other: even when you have strong evidence that a problem exists, the solution space is infinite. If research confirms that users do not trust the checkout page, there are dozens of ways to address trust perception: different badge placement, testimonial format, copy changes, seal design, guarantee messaging, operator photos, live chat. The first treatment you test is one solution out of many. If it fails, the problem is still real. You have only learned that this particular solution did not work. Keep iterating on treatments until one works.
Iterative testing also has natural political value. When a stakeholder proposes a specific change, “great idea, let’s test it” converts an untested assumption into an evidence-based decision without creating conflict. When a previous agency made changes that have become untouchable, testing whether reverting them improves performance is always legitimate.
Innovative testing is what comes after iterative testing stops working. When a page has reached the ceiling of what small changes can deliver, no iterative refinement will move the needle. The design, flow, or information architecture needs to be rethought. Innovative testing is not a cosmetic update. It is a substantive redesign of a specific part of the experience, built on deep research into what users actually need and in what order.
The research required for innovative testing goes beyond identifying individual friction points. You need to understand what users fundamentally want from the page, what information they need before they are ready to act, which benefits actually matter to them versus which ones your team assumes matter, and whether the current journey structure reflects the ideal customer journey at all.
The cost if an innovative test fails is significant: time, money, and development effort. That cost is why it requires deeper research justification than iterative testing, and why it should only be reached after iterative options have been exhausted.
| Iterative | Innovative | |
|---|---|---|
| Scope | One or several specific elements | Significant portion of page or flow |
| Research depth | Moderate | High |
| Build cost | Low | High |
| Risk | Low | High |
| Best when | Problems are identified and specific | Iterative testing has stopped producing results |
Split Path Testing
Split path testing sends different user segments through entirely different journeys, not just different versions of the same page, but different sequences of pages.
One path might go Homepage to Pricing to Sign Up. Another might go Homepage to Tour to Pricing to Sign Up. The question is not which version of the pricing page is better, but which customer journey produces more conversions.
Analytics can reveal correlations between journeys and outcomes. If users who visit a Tour page before Pricing convert at four times the rate of users who go straight to Pricing, that pattern is worth investigating. But it cannot establish that the tour caused the higher conversion. High-intent users may naturally seek out more information before committing, which means they would have converted anyway regardless of the page sequence. The only way to establish causality is to experimentally route users into each path and measure the result.
Identify candidate paths by using sequence segmentation in your analytics platform: build segments of users who followed different page sequences and compare their downstream conversion rates. Use those correlations to generate hypotheses, then test them.
Server-Side Testing
Client-side tests modify a page after it loads in the browser using JavaScript. Server-side tests serve the correct variant from the server before the page reaches the browser.
Client-side testing is accessible to most teams and covers the majority of standard CRO work: copy changes, layout adjustments, CTA variants, imagery. The trade-off is flicker risk, the brief moment where the original page renders before the variant is applied, and the fact that certain changes cannot be implemented on the client side.
Server-side testing eliminates flicker and opens up experiments that require backend logic: pricing changes, algorithmic variations, personalisation at the data layer, and full product feature experiments. The cost is developer involvement for every test, which in teams without sufficient engineering resource creates a bottleneck. Fewer experiments run per quarter means slower learning and lower overall programme impact.
Server-side testing requires what is sometimes called experimentation maturity: a culture and resourcing model where experimentation is treated as a core engineering activity. If you are building your CRO practice, master client-side testing and analytics fundamentals first. Server-side testing is an advanced capability to grow into, not a prerequisite for starting.
Frameworks and Scope
Before running experiments, decide which framework you are operating within. The Double Diamond (Discover, Define, Develop, Deliver) is design-led. The Triple Diamond adds a post-launch learning loop. Dual Track Agile runs product discovery and delivery in parallel. A standard CRO framework cycles through Research, Hypothesis, Test, Analyse, and Iterate. The framework determines how research feeds into testing, how teams are structured, and how results inform future decisions.
Scope matters enormously and differs by business model.
In e-commerce, the testing surface is the website: product pages, category pages, checkout flows, and homepage. The primary conversion event is purchase.
In SaaS, the testing surface extends into the product itself. Every onboarding step, every feature introduction, every upgrade prompt is testable. The primary conversion event is not sign-up but the moment a free trial user becomes a paying customer. Testing in SaaS requires product managers, developers, and data scientists alongside the CRO team, and the strategy must address the full lifecycle: acquisition, activation, retention, revenue expansion, and referral.
For any business, being explicit about which part of the funnel you are optimising determines which metrics matter, which research methods are appropriate, and which teams need to be involved.
Lead Generation Sites
Lead generation sites are simpler in structure than e-commerce but have their own specific challenges. The funnel is often a single page with one goal: form completion. Because the entire user experience often distils to one moment of judgment, first impressions carry more weight than in longer, multi-page flows.
Five-second testing is particularly valuable here. Show the page to participants for exactly five seconds, then ask what they remember, what the company does, and what they are being asked to do. If those three questions cannot be answered accurately after a brief exposure, the page has a clarity problem. No amount of CTA optimisation will recover a page that fails to communicate its purpose before a user forms their first impression.
For the form itself, standard analytics are insufficient. Tools like Mouseflow and Zuko provide field-level data: which fields cause hesitation, where users return to correct themselves, and which field users are interacting with when they abandon. Mouseflow gives a broader picture of on-page behaviour including heatmaps and session recordings. Zuko specialises in granular form analytics, showing completion rates, abandonment rates, and the full user path through fields. Use Zuko when field-level analysis is the primary need; use Mouseflow when form behaviour is one input among several.
Product Discovery and SaaS
Product discovery is the research and validation that happens before building a new feature. Its question is not “how do we improve this existing thing?” but “are we building the right thing at all?”
Teresa Torres’ Continuous Product Discovery framework structures this as identifying opportunities (where do users get stuck, or want something they cannot do?), generating possible solutions, and validating those solutions with experiments before committing development resources. The insight is that validating via a clickable prototype costs a fraction of validating by shipping the feature and measuring adoption.
The prototype-first principle: before building anything, create a clickable mockup and run usability tests on it. Do users understand it? Can they navigate it? Do they want it? Iterate on the prototype until confidence is high, then commit to building. This does not eliminate the need for A/B testing after launch, but it substantially reduces the risk of building something that fails.
Product discovery is increasingly where CRO specialists overlap with product managers, particularly in more mature markets. CRO research methods (user research, analytics analysis, usability studies) apply directly to product decisions. Embedding CRO capability within product teams gives access to development resources for server-side experiments and brings quantitative rigour to decisions that have historically been made by intuition or HIPPO.
SaaS Purchase Flow Testing
In SaaS, the most commercially important experiments are typically not on the marketing site but within the product itself, at the moment a free trial user decides whether to become a paying customer.
The sign-up page tests form fields, social login options, headline clarity, and trust signals. The onboarding flow tests step sequence, progress indicators, and the first action prompted. The pricing page tests plan structure, feature clarity, and psychological pricing effects. The cancellation flow tests exit surveys, pause options, and discount interventions.
The activation trigger concept is central to free trial optimisation. An activation trigger is the specific action or experience that most strongly correlates with a trial user converting to paid. Facebook’s frequently cited finding is that users who added at least six friends within their first session were significantly more likely to retain. Facebook then restructured onboarding to push users toward that threshold faster.
The principle generalises: analyse user behaviour data to find which early actions correlate with long-term conversion. Then design experiments that help more users reach those actions sooner.
One measurement consideration specific to SaaS: a test that increases free trial sign-ups is not automatically a win. If those sign-ups have lower activation rates or churn faster than the control group, the upstream metric improved while the downstream one worsened. Define your measurement horizon before the test starts, and agree with stakeholders whether the KPI is sign-up volume, paid conversion, or retention at 90 days.
Pricing Page Testing
The pricing page is where purchase intent is made or broken. Every page before it is setting expectations. If users arrive at pricing having been shown aspirational or vague messaging and the price is higher than they expected, the mismatch creates resistance regardless of how compelling the pricing page itself is. Pricing pages cannot compensate for poor pre-selling upstream.
Beyond layout and copy, three psychological principles are worth understanding and testing.
Decoy Pricing
Adding a clearly inferior option near a preferred option changes how users perceive the value of the better option. Dan Ariely documented the classic case in Predictably Irrational: a newspaper subscribed at $59 for digital only, $125 for print only, and $125 for print plus digital. The print-only option was objectively bad. Nobody chose it. But its presence made print-plus-digital look like exceptional value. Remove it, and many users chose digital only instead.
Shopify’s pricing uses the same principle. The Advanced plan at $399 per month makes the Shopify plan at $92 look reasonable by comparison. The Advanced plan exists partly as a reference point.
Price Anchoring
The first price a user encounters becomes the reference frame against which all subsequent prices are judged. Show the most expensive plan first, and cheaper plans feel like bargains. LeadPages leads with its Advanced plan at $239 per month; Standard at $27 then feels almost free. Sprint displayed Verizon and AT&T prices prominently to position their own pricing as competitive.
The anchor does not have to be the option you want users to choose. It just needs to set the right reference frame for the option you do want them to choose.
Centre Stage Effect
When presented with multiple options, users gravitate toward the middle. It reads as the safe choice: not the cheapest (which implies low quality) and not the most expensive (which feels excessive). DocuSign highlights its middle tier and labels it “Most Popular.” Placing your highest-margin plan in that position, and emphasising it visually with a different colour or border, directs attention without overt pressure.
Adding a premium tier can also shift perception of the tier below it. Introducing a top-end option pushes the next option down toward feeling like the sensible middle ground, even if it is actually the second-most-expensive plan. Coca-Cola demonstrated this effect with cup sizing: adding a very large option caused more customers to choose the second-largest size rather than the medium.
All three of these principles are well-documented and plausible. None of them is guaranteed to work for your product and audience. Test them.
| Strategy | What it does | Test when |
|---|---|---|
| Decoy pricing | Makes the preferred plan look better next to an inferior option | You have three or more plans and want to push users away from the cheapest |
| Anchoring | Sets a reference price that makes the target plan feel reasonable | Your target plan is not the cheapest and you want to shift value perception |
| Centre stage | Draws attention to the middle option | You have three or more plans and want users on the middle tier |
| Competitor anchoring | Uses rivals’ pricing to make yours look more attractive | You are price-competitive and users are likely to comparison shop |
Churn Testing
Before discussing churn reduction tactics, the ethical baseline: do not obstruct cancellation. Hiding the cancel button, routing users through an unnecessarily complex flow to prevent them from leaving, or using confusion to delay a decision they have already made is both unethical and increasingly illegal. It produces chargebacks, reputational damage, and a support load that negates any short-term revenue benefit.
The goal of churn testing is to help users who have a fixable reason for leaving, not to trap users who genuinely want out.
The most effective churn interventions are personalised. When a user initiates cancellation, ask why, and respond to the stated reason specifically. A user who says the product is too expensive is a different problem from a user who says they are not using it enough. The first might respond to a lower tier or a temporary discount. The second might respond to a pause option that lets them suspend the subscription for 30 or 60 days. A user who wants a feature that is on the roadmap might respond to a beta invite.
Loss aversion messaging can be effective when it is genuine. Reminding users what they will lose when they cancel activates the psychological asymmetry where losses feel roughly twice as painful as equivalent gains. Facebook surfaced messages about friend updates users would miss at the point of account deletion. If a user will genuinely lose saved data, integrations, or history that cannot be recovered, making that concrete is both honest and persuasive. Apply this principle only where it is true.
Measurement in churn testing requires thought. If the experiment covers an online cancellation flow, direct measurement is straightforward. If some users cancel by phone, you need trackable phone numbers per variant. Define in advance whether success means fewer cancellations in the test window or fewer cancellations over 90 days. Users retained short-term by a discount or pause option may still churn at the next billing cycle. The right measurement horizon depends on your subscription structure.
Quick Reference
Testing approach by traffic level
| Monthly visitors | Recommended approach |
|---|---|
| 1M+ | Iterative, isolated variables; small effect sizes detectable |
| 100k to 1M | Iterative with grouped changes; MVT possible |
| Under 100k | Innovative tests requiring large effect sizes; avoid MVT |
| Under 10k | Qualitative research, prototype testing, smoke tests; defer A/B until traffic grows |
See you soon.