Unit Roots¶

by Professor Throckmorton
for Time Series Econometrics
W&M ECON 408
Slides

Summary¶

  • The presence of unit roots make time series non-stationary.
  • Trend Stationarity: A trend-stationary time series has a deterministic time trend and is stationary after the trend is removed.
  • Difference Stationarity: A difference-stationary time series has a stochastic time trend and becomes stationary after differencing.
  • Unit root tests, e.g., the augmented Dickey-Fuller test (ADF test) are used to detect the presence of a unit root in a time series.
  • Unit root tests help determine if differencing is required to achieve stationarity, suggesting a stochastic time trend.
  • If an AR model has a unit root, it is called a unit root process, e.g., the random walk model.

AR($p$) Polynomial¶

  • Consider the AR($p$) model: $y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \varepsilon_t$

  • Whether an AR($p$) produces stationary time series depends on the parameters $\phi_j$, e.g., for AR($1$), $|\phi_1|<1$

  • This can be written in terms of an AR polynomial with the lag operator, $L$, defined as $L^jy_t \equiv y_{t-j}$

    $$ \varepsilon_t = y_t - \phi_1 y_{t-1} - \phi_2 y_{t-2} - \dots - \phi_p y_{t-p} = (1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p)y_t $$

  • Define the AR polynomial as $\Phi(L) \equiv 1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p$

Unit Roots of AR($p$)¶

  • To find the roots of an AR($p$), set $\Phi(r) = 0$ and solve for $r$ (which is sometimes complex), and the number of roots will equal the order of the polynomial.
  • An AR($p$) has a unit root if the modulus of any of the roots equal $1$.
  • For each unit root, the data must differenced to make it stationary, e.g., if there are $2$ unit roots then the data must be differenced twice.
  • If the modulus of all the roots exceeds $1$, then the AR($p$) is stationary.

Examples¶

  • AR($1$): $y_t = \phi_1 y_{t-1} + \varepsilon_t$
    • $\Phi(r) = 0 \rightarrow 1-\phi_1 r = 0 \rightarrow r = 1/\phi_1$
    • Thus, an AR($1$) has a unit root, $r = 1$, if $\phi_1 = 1$.
    • If $\phi_1 = 1$, then $y_t = y_{t-1} + \varepsilon_t$ is a random walk model that produces non-stationary time series.
    • But the first difference $\Delta y_t = y_t - y_{t-1} = \varepsilon_t$ is stationary (since $\varepsilon_t$ is white noise).
    • Note if $|\phi_1| < 1$, then $|r|>1$ and the AR($1$) is stationary.
  • AR($2$): $y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t$ or $\varepsilon_t = (1 - \phi_1 L - \phi_2 L^2) y_t$
    • The AR($2$) polynomial has a factorization such that $1 - \phi_1 L - \phi_2 L^2 = (1-\lambda_1 L)(1-\lambda_2 L)$ with roots $r_1 = 1/\lambda_1$ and $r_2 = 1/\lambda_2$
    • Thus, an AR($2$) model is stationary if $|\lambda_1| < 1$ and $|\lambda_2| < 1$.
    • Given the mapping $\phi_1 = \lambda_1 + \lambda_2$ and $\phi_2 = -\lambda_1 \lambda_2$, we can prove that the modulus of both roots exceed $1$ if
      • $|\phi_2| < 1$
      • $\phi_2 + \phi_1 < 1$
      • $\phi_2 - \phi_1 < 1$

Unit Root Test¶

  • The augmented Dickey-Fuller test (ADF test) has hypotheses

    • $h_0$: The time series has a unit root, indicating it is non-stationary.
    • $h_A$: The time series does not have a unit root, suggesting it is stationary.
  • Test Procedure

    • Write an AR($p$) model with the AR polynomial, $\Phi(L)y_t = \varepsilon_t$.
    • If the AR polynomial has a unit root, it can be written as $\Phi(L) = \varphi(L)(1 - L)$ where $\varphi(L) = 1 - \varphi_1 L - \cdots - \varphi_{p-1} L^{p-1}$.
    • Define $\Delta y_t = (1 - L)y_t = y_t - y_{t-1}$.
    • An AR($p$) model with a unit root is then $\Phi(L)y_t = \varphi(L)(1 - L)y_t = \varphi(L)\Delta y_t = \varepsilon_t$.
    • Thus, testing $\Phi(L)$ for a unit root is equivalent to the test $h_0: \rho = 0$ vs. $h_A: \rho < 0$ in

    $$ \Delta y_t = \rho y_{t-1} + \varphi_1 \Delta y_{t-1} + \cdots + \varphi_{p-1} \Delta y_{t-p+1} + \varepsilon_t $$

  • Intuition: A stationary process reverts to its mean, so $\Delta y_t$ should be negatively related to $y_{t-1}$.

adfuller¶

  • In statsmodels, the function adfuller() conducts an ADF unit root test.
  • If the p-value is smaller than $0.05$, then the series is probably stationary.
In [ ]:
# ADF Test
from statsmodels.tsa.stattools import adfuller
# Function to organize ADF test results
def adf_test(data):
    keys = ['Test Statistic','p-value','# of Lags','# of Obs']
    values = adfuller(data)
    test = pd.DataFrame.from_dict(dict(zip(keys,values[0:4])),
                                  orient='index',columns=[data.name])
    return test

ADF test results¶

In [ ]:
test = adf_test(df['d1loggdp'].dropna())
#print(test.to_markdown())
d1loggdp
Test Statistic -3.1799
p-value 0.0211781
# of Lags 4
# of Obs 118
  • The p-value < $0.05$, reject the null hypothesis that the process is non-stationary.
  • The number of lags is chosen by minimizing the Akaike Information Criterion (AIC) (more on that soon).
In [ ]:
dl = []
for column in df.columns:
    test = adf_test(df[column].dropna())
    dl.append(test)
results = pd.concat(dl, axis=1)
#print(results.to_markdown())
gdp loggdp d1loggdp d2loggdp
Test Statistic 4.294 -1.27023 -3.1799 -6.54055
p-value 1 0.642682 0.0211781 9.37063e-09
# of Lags 7 8 4 3
# of Obs 119 118 118 118
  • China GDP has a unit root after taking log.
  • China GDP is probably stationary after removing seasonality and converting to growth rate (p-value < $0.05$).
  • Second difference of China GDP is almost certainly stationary.

U.S. Unemployment Rate¶

In [ ]:
# Read Data
fred_api_key = pd.read_csv('fred_api_key.txt', header=None)
fred = Fred(api_key=fred_api_key.iloc[0,0])
data = fred.get_series('UNRATE').to_frame(name='UR')
# Plot
fig, ax = plt.subplots(figsize=(6.5,2));
ax.plot(data.UR); ax.set_title('U.S. Unemployment Rate');
ax.yaxis.set_major_formatter('{x:,.1f}%')
ax.grid(); ax.autoscale(tight=True)
No description has been provided for this image
In [ ]:
adf_UR = adf_test(data.UR)
print(adf_UR)
                        UR
Test Statistic   -3.918852
p-value           0.001900
# of Lags         1.000000
# of Obs        928.000000
  • ADF test suggests that U.S. unemployment rate is stationary (p-value < 0.05).
  • This doesn't necessarily contradict the ACF, which suggested the UR might be non-stationary, i.e., we couldn't tell with certainty.
  • This result may not be surprising since modeling the UR as AR($1$) led to an estimate of the AC parameter that was less than $1$, i.e., it was not close to a random walk.

Order Determination¶

  • There are a lot of possible choices for the orders of an ARIMA($p,d,q$) model.
  • Even if the data is not seasonal and already stationary, there are plenty of possibilities for an ARMA($p,q$) model.
  • In the ADF tests above, adfuller selected the $p$ lags for an AR($p$) model for the first-differenced time series to test for a unit root.
  • How do we select the best orders for an ARMA($p,q$) model?

Information Criterion¶

  • Information criteria are used for order determination (i.e., model selection) of ARMA($p,q$) models.
  • In adfuller, the default method to select lags is the Akaike Information Criterion (AIC). Another choice is the Bayesian Information Criterion (BIC) (a.k.a. Schwarz Information Criterion (SIC)).
  • The AIC and BIC are commonly used in economics research using time series data.
  • But there are other choices as well, e.g., the Hannan-Quinn Information Criterion (HQIC).
  • The goal of AIC, BIC, and HQIC is to estimate the out-of-sample mean square prediction error, so a smaller value indicates a better model.

AIC vs. BIC¶

  • Let $\mathcal{L}$ be the maximized log-likelihood of the model, $K$ be the number of estimated parameters, and $n$ be the sample size of the data.
    • $AIC = -2 \mathcal{L} + 2 K$
    • $BIC = -2 \mathcal{L} + \log(n) K$
  • The key difference is that
    • AIC has a tendency to overestimate the model order.
    • BIC places a bigger penalty on model complexity, hence the best model is parsimonious relative to AIC.
  • Consistency, i.e., as the sample size increases does the information criteria select the true model?
    • AIC is not consistent, but it is asymptotically efficient, which is relevant for forecasting (more on that much later).
    • BIC is consistent.
  • When AIC and BIC suggest different choices, most people will choose the parsimonious model using the BIC.

Likelihood Function¶

  • Likelihood function: Probability of the observed time series data given a model and its parameterization.
  • Maximum Likelihood Estimation: For a given model, find the parameterization that maximizes the likelihood function, i.e., such that the observed time series has a higher probability than with any other parameterizaiton.
  • The likelihood function is the joint probability distribution of all observations.
  • In ARMA/ARIMA models, the likelihood function is just a product of Gaussian/normal distributions since that is the distribution of the errors/innovations.
  • Order determination/model selection:
    • Given some time series data and a menu of models, which model has the highest likelihood and is relatively parsimonious?
    • e.g., for each AR model we can compute the ordinary least squares estimates and then the value of the likelihood function for that parameterization.

Example¶

  • Assume an MA(2) model: $y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2}$, where $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$ are i.i.d. normal errors. Note: by construction $y_t$ is uncorrelated with $y_{t-3}$

  • The likelihood function is the joint probability distribution of the observations $y_1, y_2, \dots, y_T$, given parameters of the model $\Theta = \{\mu, \theta_1, \theta_2, \sigma^2\}$

    $$ \mathcal{L}(\Theta| y_1, \dots, y_T) = \prod_{t=1}^{T} f(y_t | y_{t-1}, y_{t-2}; \Theta) $$ where $f(y_t | y_{t-1}, y_{t-2}; \Theta)$ is the conditional normal density

    $$ f(y_t | y_{t-1}, y_{t-2};\Theta) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{\varepsilon_t^2}{2\sigma^2} \right) $$

  • Combining the previous expressions and substituting out $\varepsilon_t$ for $y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2}$ yields

    $$ \mathcal{L}(\Theta) = \prod_{t=1}^{T} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2})^2}{2\sigma^2} \right) $$

  • The log-likelihood function (i.e., taking natural log converts a product to a sum) is

    $$ \log \mathcal{L}(\Theta) = -\frac{T}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{t=1}^{T} (y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2})^2 $$

  • To estimate $\mu, \theta_1, \theta_2, \sigma^2$, maximize this log-likelihood function, i.e., $\hat{\Theta} = \textrm{arg max}_\Theta \log \mathcal{L}(\Theta)$.

  • Because $\log \mathcal{L}(\Theta)$ is nonlinear in the parameters, numerical methods such as nonlinear optimization or the conditional likelihood approach are used in practice.