Most research describes. It tells you what is happening, how many customers churned, which page converts best, what people say they want. Causal research is more ambitious. It tries to prove why something happens, and the primary tool for doing that is the experiment. Yet experimentation comes with a discipline that surprises newcomers, because a well-run experiment never proves causation in the absolute sense people expect. It does something more honest and more useful: it lets you infer cause with confidence by ruling out the alternatives. This article is about how that inference works, what threatens it, and how the different experimental designs are built to defend against those threats.
What causality actually means
The everyday idea of cause is absolute. If X causes Y, then X is the cause and X always produces Y, and a good experiment ought to prove it. The scientific meaning is more modest on every count. X is treated as one of several possible causes of Y, not the only one. X is said to make Y more probable, not to guarantee it. And crucially, you can never prove the causal link, you can only infer it from evidence. That humility is not a weakness of the method, it is the method, and accepting it is the first step to designing experiments well.
To infer that X causes Y, three conditions must all hold. The first is concomitant variation, meaning X and Y vary together in the predicted way, so that when X changes Y changes in the expected direction. The second is time order, meaning X occurs before or at the same time as Y, because an effect cannot precede its cause. The third, and the hardest, is the absence of other causal factors, meaning every other plausible explanation for the change in Y has been controlled or eliminated. Concomitant variation and time order are often easy to establish. It is the third condition that makes experimental design difficult, and it is precisely the work of ruling out rival explanations that gives a good experiment its value.
The vocabulary of experiments
A handful of terms recur throughout. The independent variable is the thing the researcher deliberately manipulates, such as price level, advertising spend, or packaging design. The dependent variable is the outcome being measured, such as sales, market share, or brand attitude. The test units are the entities whose responses are recorded, which might be individual consumers, stores, cities, or websites. And the extraneous variables are everything else that could affect the outcome, such as store size, location, the season, or what competitors happen to be doing. The experimental design is the full plan that ties these together: who is assigned where, what is manipulated, what is measured, and how the extraneous variables are kept from contaminating the result.
Experimental designs are often sketched with a compact notation. An X marks exposure to a treatment, an O marks an observation or measurement of the outcome, and an R marks random assignment of test units to groups, with time reading from left to right. So a line reading R, O, X, O describes a group that was randomly formed, measured, treated, and measured again. Learning to read these diagrams makes the logic of each design immediately visible.
Two kinds of validity
Before looking at the designs, it helps to separate two questions that are easy to conflate. Internal validity asks whether X actually caused the observed change in Y within this experiment, and it is threatened by extraneous variables that were not controlled. External validity asks whether the findings generalise beyond this particular experiment, and it is threatened by artificial conditions that do not reflect the real world. The two often pull against each other. A tightly controlled laboratory experiment has high internal validity but low external validity, because its sterile conditions are nothing like a real shopping environment. A field experiment in actual stores has the reverse profile, high external validity but lower internal validity, because the messy real world keeps intruding. Good design is largely about managing that trade-off deliberately rather than stumbling into it.
What can go wrong
The threats to internal validity are the extraneous variables, and naming them is the first step to defending against them. History refers to external events that happen during the experiment, such as a competitor launching a major promotion while you test new packaging. Maturation refers to changes in the test units over time that have nothing to do with the treatment, such as participants growing tired or bored during a long study. There are two testing effects: the main testing effect, where simply taking a pre-test changes how people respond later, and the interactive testing effect, where the pre-test changes how people react to the treatment itself, for instance by making them unusually attentive to a brand. Instrumentation refers to changes in the measuring tool, observer, or scoring method between measurements. Statistical regression is the tendency of extreme scorers to drift back toward the average regardless of any treatment, which is why choosing unusually low-performing stores for a test is dangerous, since they would likely have improved anyway. Selection bias arises when test units are not properly balanced across conditions, such as putting the younger, more tech-savvy participants in the digital condition. And mortality refers to test units dropping out before the end, which biases the sample if the dropouts differ systematically from those who remain.
Keeping the threats out
There are four main ways to control extraneous variables. Randomisation assigns test units to groups by chance, which, given enough units, tends to balance every extraneous variable across the groups at once, and it is the single most powerful tool available. Matching compares test units on key background variables and balances the groups on those before assignment, which works when you know which variables matter and can measure them. Statistical control measures the extraneous variables and adjusts for their effect during analysis, using techniques such as analysis of covariance, and it helps when full randomisation is not possible. Design control builds the structure of the experiment so that a specific known variable is isolated, for example by blocking on store size. Randomisation is the default because it handles even the variables you did not think of, while the other three are targeted tools for known, measurable threats.
The designs, from weak to strong
Experimental designs fall into four families, and they form a clear ladder of rigour.
The pre-experimental designs use no randomisation and are weak on internal validity, so their findings should be treated as exploratory at best. The one-shot case study exposes a single group to a treatment and measures it once, giving you a number with no baseline and no comparison, so you cannot know what caused it. The one-group pre-test-post-test design measures before and after the treatment, which feels more rigorous, but with no control group, history, maturation, testing effects, and instrumentation all remain uncontrolled, so any change might have nothing to do with the treatment. The static group design compares a treated group against an untreated one, but because the units are not randomly assigned, selection bias is a serious threat and the two groups may simply have differed to begin with.
The true experimental designs introduce random assignment, which is what eliminates selection bias and spreads the other extraneous variables evenly across groups. The pre-test-post-test control group design randomly forms two groups, measures both, treats only one, and measures both again, with the treatment effect estimated as the change in the treated group minus the change in the control group. The control group is the key, because it absorbs the effects of history, maturation, the main testing effect, instrumentation, regression, and mortality over the same period, so subtracting its change leaves the treatment effect almost cleanly isolated. The one threat that survives is the interactive testing effect, because the pre-test conditions the treated group in a way the design cannot subtract away. The post-test-only control group design solves that by dropping the pre-test entirely, randomly forming two groups, treating one, and measuring both only at the end. With no pre-test there are no testing effects at all, randomisation still handles selection bias, and the result is often cleaner than the pre-test version when a pre-test would risk sensitising participants. The Solomon four-group design combines both approaches, with two groups receiving a pre-test and two not, which lets you explicitly measure the interactive testing effect, at the cost of needing four groups and roughly double the resources.
The quasi-experimental designs are for situations where you cannot fully randomise or control the timing of the treatment. The time series design measures a single group repeatedly before and after the treatment, using the pre-treatment measurements to establish a baseline trend, so that an abrupt shift at the treatment point counts as evidence of an effect. Its weakness is that history remains uncontrolled, since some external event could have caused the shift instead. The multiple time series design strengthens this by adding a carefully chosen control group measured over the same period, which lets you check the treatment effect twice, once against the experimental group’s own prior trend and once against the control group’s trajectory.
The statistical designs go beyond a simple treatment-versus-control structure and let you test several independent variables at once, control specific extraneous variables statistically, and reuse test units for economy. A randomised block design controls for one major known extraneous variable, such as store size, by matching groups on it. A Latin square design controls for two such variables at once, provided you can assume their interaction is negligible. A factorial design measures two or more independent variables at multiple levels, and its great advantage is that it reveals interaction effects, not just the separate effect of each variable. It can answer a question that testing variables one at a time never can, such as whether price sensitivity depends on the advertising format, because the answer lives in how the two variables combine.
Choosing a design
The choice follows a short chain of questions. If you cannot randomly assign test units to groups, ask whether you can at least take repeated measurements over time. If you can, a time series design, or better a multiple time series design when a control group is available, is your route. If you cannot, you are limited to the pre-experimental designs and should treat the findings as exploratory. If you can randomise, then ask whether you need a pre-test measurement. If you do, use the pre-test-post-test control group design, adding the Solomon fourth group when the interactive testing effect must be isolated. If you do not, the post-test-only control group design is cleaner because it removes the testing effects entirely, and if you also need to test several independent variables together, step up to a factorial design.
The limits of experiments
Experiments are powerful but costly in several ways worth weighing in advance. They take time, and long-term effects require long studies during which market conditions may shift before the results arrive. They are expensive, since control groups, repeated measurements, and multiple markets all multiply the budget. They are hard to administer, especially in the field, where extraneous variables intrude no matter how carefully you plan. And they are vulnerable to competitive contamination, because rivals may deliberately change their pricing or promotions to muddy your field test once they notice it.
Two applied examples
Consider a premium watch brand wanting to know whether higher advertising spend lifts sales. A pre-test-post-test control group design run as a field experiment fits well. You select three sets of test markets matched on demographics, competitive environment, and current sales, then randomly assign each set to increased, decreased, or maintained spend, track sales for three months, and compare the before-and-after change across conditions. The difficulties are real and instructive: finding three genuinely equivalent markets, obtaining reliable sales data across them, the risk that competitors change behaviour mid-test, and the chance that retailers in the reduced-spend market object or compensate on their own. Management support matters here, securing data access, allocating budget, and managing the retailer relationships.
Now consider a beverage company testing four package designs, the current one plus three new candidates. A post-test-only control group design run online is the natural fit. You place each design at a separate URL, randomly assign recruited participants to one of them, have each person view a single package and answer attitude and purchase-intent questions, then compare scores across the four conditions. This is a clean design precisely because there is no pre-test to sensitise anyone, randomisation handles selection bias, and the controlled online setting limits outside interference.
Pitfalls to avoid
A few mistakes recur often enough to name. The first is confusing statistical significance with proof of causation, since even a significant result from a sound experiment only supports an inference, and replication, not a single study, is what builds real confidence. The second is ignoring the interactive testing effect, which the popular pre-test-post-test control group design leaves uncontrolled, so when a pre-test might change how people respond to the treatment, switch to the post-test-only or Solomon design. The third is choosing test markets for convenience rather than equivalence, which quietly introduces selection bias at the level of the market. The fourth is attributing a time series change to the treatment when an upward trend after the intervention may simply continue a trend that began before it, which is why you must always examine the pre-treatment trend line first. The fifth is underestimating mortality, because dropouts are rarely random and a biased remainder distorts the result. And the last is running a factorial design without enough statistical power, since testing multiple variables and their interactions needs a substantially larger sample than a simple two-group test, and an underpowered factorial produces unreliable interaction estimates.
Experimentation, in the end, is less about a single clever test than about systematically closing off every explanation but the one you are testing. Get that discipline right and you earn the right to say not just what happened, but why.
See you soon.