Dynamic Causal Effects in Time Series

Forecasting predicts the future; dynamic causal effects measure how a change in X moves Y over time. A guide to distributed lag models, exogeneity, multipliers, HAC standard errors, and GLS.

Most of what people do with time series is forecasting, using the past to guess the future value of some variable. Estimating a dynamic causal effect is a different and more demanding goal. Here the question is not what the future value of Y will be, but how much a change in some variable X actually causes Y to move, and crucially, how that effect plays out not just today but over the months and quarters that follow. The effect is dynamic because a single push on X today ripples forward through time.

The classic illustration comes from the market for frozen concentrated orange juice. Let Y be the percent change in juice prices in a given month and let X be the number of freezing-degree days, a measure of how harsh the cold was in Florida that month. A cold snap today damages the orange crop, but the consequences do not land all at once. Supply is squeezed in the months that follow, and prices keep responding well after the freeze itself. A regression that uses only the current month’s weather captures the first jolt and misses everything that comes after. To recover the full price response you have to include lagged values of the weather variable, letting the model trace the effect as it unfolds.

Identification with a single entity

In cross-sectional work the gold standard for a causal claim is a randomised controlled trial: many subjects, randomly sorted into a treatment group and a control group, with the random assignment guaranteeing that the groups differ only in the treatment. Time series rarely offers anything like this. Often there is just one entity, the US economy, say, observed repeatedly over time. There is no separate control group to compare against.

The resolution is to let the single entity serve as both treatment and control, with the comparison happening across time rather than across subjects. Some periods experience a change in X and play the role of treatment, while other periods do not and play the role of control. The randomisation, when it exists, occurs over time. Weather is a clean case, since the timing of a freeze is effectively random with respect to the economics of the juice market, which is exactly what makes the orange juice setting such a useful teaching example.

The distributed lag model

The workhorse specification writes Y as a function of the current value of X and a series of its lags.

Yt=β0+β1Xt+β2Xt1++βr+1Xtr+utY_t = \beta_0 + \beta_1 X_t + \beta_2 X_{t-1} + \cdots + \beta_{r+1} X_{t-r} + u_t

The first slope coefficient measures the contemporaneous effect, the impact of X on Y within the same period. Each subsequent coefficient carries a dual reading that is worth pausing on. It can be seen as the effect of a past value of X on today’s Y, or equivalently as the effect of today’s X on a future Y. These are two ways of describing the same number, and the equivalence is what lets the model speak about effects that stretch into the future. The error term collects measurement error and whatever relevant variables have been left out.

Two conditions are built into this model. The first is that the dynamic causal effects are constant, meaning the coefficients do not drift over time, so that a one-unit increase in X has the same consequences whenever it happens. This requires Y and X to be jointly stationary, a property worth checking with unit root and structural break tests such as the augmented Dickey-Fuller and Quandt likelihood ratio tests. The second condition is exogeneity, which requires the lags of X to be uncorrelated with the error term, and it is this requirement that does the heavy lifting in turning a regression into a causal statement.

Two flavours of exogeneity

In ordinary regression we ask that the regressors be uncorrelated with the error. Time series forces us to be more precise, because there are past, present, and future values of X to worry about.

The weaker condition, usually called past-and-present exogeneity or simply exogeneity, says that the error in a given period is uncorrelated with current and all past values of X.

E[ut|Xt,Xt1,Xt2,]=0E[u_t \mid X_t, X_{t-1}, X_{t-2}, \ldots] = 0

This is enough for ordinary least squares to deliver consistent estimates of the dynamic effects, and it carries the implication that any causal effect beyond the last included lag is zero.

The stronger condition, strict exogeneity, additionally rules out any correlation between today’s error and future values of X.

E[ut|,Xt+1,Xt,Xt1,]=0E[u_t \mid \ldots, X_{t+1}, X_t, X_{t-1}, \ldots] = 0

Strict exogeneity breaks down precisely when movements in Y feed back into future X. The orange juice market is exogenous but not strictly so, because traders forecast future weather and act on those forecasts, and prices, which are buffeted by the same error term, end up entangled with how future weather information gets used. A field experiment that assigns fertiliser to tomato plots at random each season is strictly exogenous, since next season’s assignment is decided by a coin and cannot possibly be correlated with anything driving this season’s yield. Strict exogeneity always implies ordinary exogeneity, but the reverse does not hold.

Estimating the model with OLS

When X is exogenous, though not necessarily strictly so, ordinary least squares estimates the distributed lag model under a familiar set of conditions adapted to the time dimension. The conditional mean of the error given current and past X must be zero. Y and X must share a stationary joint distribution, so that the relationship inside the sample also holds outside it. They must be weakly dependent, meaning observations far apart in time become effectively independent, which keeps the law of large numbers and central limit theorem working. Large outliers must be unlikely, and here the moment requirement is strengthened to more than eight finite nonzero moments, a stronger demand than usual because the variance estimator introduced below needs it to be provably consistent. Finally there must be no perfect multicollinearity.

The subtle problem in this setting is that the error term is very often serially correlated. The omitted variables hiding inside it, aggregate income being a natural example for the juice market, tend to move slowly and persistently over the business cycle, and that persistence transmits into the errors. Importantly, this is not the same as omitted variable bias. Aggregate income can be uncorrelated with the weather even while being correlated with itself over time, so exogeneity survives and the point estimates remain consistent. What does not survive is the usual formula for the standard errors, which becomes inconsistent. Left uncorrected, every hypothesis test and confidence interval built on those standard errors is misleading.

Dynamic multipliers

The coefficients of the model have a direct interpretation as multipliers. The contemporaneous dynamic multiplier is the first slope, the immediate effect of a one-unit change in X on Y in the same period. The h-period dynamic multiplier is the coefficient sitting on the h-th lag, capturing the effect of today’s change in X on Y h periods later. Adding all of them together gives the long-run cumulative dynamic multiplier, the total effect of a permanent one-unit increase accumulated across every period it touches.

θ=j=1r+1βj\theta = \sum_{j=1}^{r+1} \beta_j

Getting a correct standard error for this sum directly is awkward, since it is a function of several estimated coefficients. There is a clean trick. Rewrite the model in terms of the changes in X rather than its levels.

Yt=β0+δ1ΔXt+δ2ΔXt1++δrΔXtr+1+θXtr+utY_t = \beta_0 + \delta_1 \Delta X_t + \delta_2 \Delta X_{t-1} + \cdots + \delta_{r} \Delta X_{t-r+1} + \theta X_{t-r} + u_t

After the algebra settles, the coefficient on the final level term in this reparametrised regression is exactly the cumulative multiplier. Estimate this equation by least squares and the cumulative effect appears as a single coefficient, with its correct standard error read straight off the output rather than assembled by hand.

Inference with HAC standard errors

To see why serial correlation matters for inference, define the product of the demeaned regressor and the error as a single quantity. When the error is serially correlated, that product inherits the correlation, and the large-sample variance of the slope estimator picks up an extra factor relative to the textbook formula.

Var(β^1)1Tσu2σX2fT \text{Var}(\hat{\beta}_1) \approx \frac{1}{T} \cdot \frac{\sigma_u^2}{\sigma_X^2} \cdot f_T

When the product term behaves like independent noise this correction factor equals one and the expression collapses back to the ordinary formula. When the product is serially correlated the factor departs from one and encodes the autocorrelation structure, which is precisely the piece the ordinary standard error ignores.

The standard fix is the heteroskedasticity and autocorrelation consistent, or HAC, standard error, with the Newey-West estimator being the most common version. It estimates the correction factor from weighted sample autocovariances of the product term, summing them up to a chosen number of lags. That truncation point grows slowly with the sample size by a simple rule of thumb.

m=0.75×T1/3m = 0.75 \times T^{1/3}

The resulting HAC standard error then replaces the ordinary one in every test statistic and confidence interval, and it is consistent under the assumptions above, including the eight-moments condition that justified the strengthened outlier requirement.

Estimating under strict exogeneity

If X is strictly exogenous, and if the error follows a first-order autoregressive process, we can do better than simply patching the standard errors, because we can model the serial correlation away entirely.

ut=ρut1+u~tu_t = \rho u_{t-1} + \tilde{u}_t

Here the new innovation is serially uncorrelated. Two equivalent routes exploit this structure.

The first transforms the distributed lag model into an autoregressive distributed lag model by multiplying through with the lag polynomial that matches the error’s persistence. The transformed equation includes a lag of Y among its regressors and, by construction, carries an error that is now white noise.

Yt=β0+ρYt1+β1Xt+β2Xt1++u~tY_t = \beta_0^{} + \rho Y_{t-1} + \beta_1^{} X_t + \beta_2^{*} X_{t-1} + \cdots + \tilde{u}_t

Because the transformed error is serially uncorrelated, ordinary least squares is valid here and HAC standard errors are not needed. The original dynamic multipliers can be recovered afterwards from the autoregressive distributed lag coefficients by recursive substitution.

The second route is generalised least squares, implemented as the Cochrane-Orcutt procedure, which works with quasi-differences of the variables.

Y~t=YtρYt1,X~t=XtρXt1\tilde{Y}t = Y_t – \rho Y{t-1}, \qquad \tilde{X}t = X_t – \rho X{t-1}

In quasi-differenced form the model again has a serially uncorrelated error.

Y~t=β0(1ρ)+β1X~t+β2X~t1++u~t\tilde{Y}_t = \beta_0 (1 – \rho) + \beta_1 \tilde{X}t + \beta_2 \tilde{X}{t-1} + \cdots + \tilde{u}_t

Since the autocorrelation parameter is unknown, the feasible version estimates it in two passes. First, fit the original model by least squares and keep the residuals. Second, regress those residuals on their own lag to get an estimate of the autocorrelation. Form the quasi-differences using that estimate and run least squares on the quasi-differenced equation to obtain the feasible generalised least squares results. The serial correlation lurking in the first-pass residuals does not spoil this, because consistency of those first-pass estimates was never threatened by serial correlation in the first place. With the true autocorrelation known, generalised least squares would be the best linear unbiased estimator under strict exogeneity. The feasible version approaches that ideal in large samples, and either way generalised least squares is more efficient than ordinary least squares, delivering smaller standard errors.

An application: predicting stock returns

A well-known application puts these tools to work forecasting excess returns on the stock market using a single predictor known as CAY, a variable built by Lettau and Ludvigson from the long-run relationship between consumption, labour income, and asset wealth. The exercise constructs direct multi-period forecasts and leans on HAC standard errors throughout, because forecasting several periods ahead with overlapping windows mechanically introduces serial correlation into the forecast errors.

The data work is mostly preparation. Monthly market returns are compounded up to quarterly frequency by summing log returns within each quarter and converting back, the data are collapsed to one observation per quarter, merged with the CAY series, and declared as a time series so that lead and lag operators become available. The forecasting itself is then a handful of lines.

* One-quarter-ahead forecast, ordinary SEs (to read off R-squared)
regress f.qret cay, robust
* Newey-West truncation, then the HAC version of the same forecast
newey f.qret cay, lag(4)
* Two-quarter-ahead direct forecast with the same HAC correction
newey f2.qret cay, lag(4)

The lead operator on the dependent variable is what makes these direct forecasts: regressing next quarter’s return on today’s CAY estimates a one-step-ahead relationship, and using the two-period lead estimates a two-step-ahead one. The truncation lag of four follows the rule of thumb for a sample of around 159 quarters.

Three points stand out in the results. CAY shows real predictive content, explaining roughly eight percent of the variation in quarterly excess returns, which is notable given how hard returns are to forecast. The HAC standard errors come out larger than the ordinary ones, reflecting the serial correlation that the multi-period forecast structure injects, so honest inference is more conservative than the naive output would suggest. And the fit deteriorates as the horizon lengthens, with the two-quarter-ahead forecast explaining less than the one-quarter-ahead version, a plain reminder that the further out you reach, the harder prediction becomes. One practical note: the Newey-West command does not report an R-squared, so the ordinary regression is used to read the fit while the HAC command is used for trustworthy standard errors.

Conclusion

Estimating a dynamic causal effect comes down to three decisions made in sequence. Specify a distributed lag model rich enough to capture how the effect spreads forward in time. Decide which form of exogeneity is credible, since that governs whether ordinary least squares with HAC standard errors is enough or whether the stronger machinery of the autoregressive distributed lag and generalised least squares is available. Then handle the serial correlation that almost always infects the errors, either by correcting the standard errors after the fact or by transforming it out of the model. Handle those three steps with care and a single noisy time series, with no control group in sight, can still yield a defensible statement about cause and effect.

See you soon.

View Comments (1)

Leave a Reply

Prev

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading