Unit Roots¶

by Professor Throckmorton
for Time Series Econometrics
W&M ECON 408
Slides

Summary¶

The presence of unit roots make time series non-stationary.
Trend Stationarity: A trend-stationary time series has a deterministic time trend and is stationary after the trend is removed.
Difference Stationarity: A difference-stationary time series has a stochastic time trend and becomes stationary after differencing.
Unit root tests, e.g., the augmented Dickey-Fuller test (ADF test) are used to detect the presence of a unit root in a time series.
Unit root tests help determine if differencing is required to achieve stationarity, suggesting a stochastic time trend.
If an AR model has a unit root, it is called a unit root process, e.g., the random walk model.

AR($p$) Polynomial¶

Consider the AR($p$) model: $y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \varepsilon_t$
Whether an AR($p$) produces stationary time series depends on the parameters $\phi_j$, e.g., for AR($1$), $|\phi_1|<1$
This can be written in terms of an AR polynomial with the lag operator, $L$, defined as $L^jy_t \equiv y_{t-j}$

$$ \varepsilon_t = y_t - \phi_1 y_{t-1} - \phi_2 y_{t-2} - \dots - \phi_p y_{t-p} = (1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p)y_t $$
Define the AR polynomial as $\Phi(L) \equiv 1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p$

Unit Roots of AR($p$)¶

To find the roots of an AR($p$), set $\Phi(r) = 0$ and solve for $r$ (which is sometimes complex), and the number of roots will equal the order of the polynomial.
An AR($p$) has a unit root if the modulus of any of the roots equal $1$.
For each unit root, the data must differenced to make it stationary, e.g., if there are $2$ unit roots then the data must be differenced twice.
If the modulus of all the roots exceeds $1$, then the AR($p$) is stationary.

Examples¶

AR($1$): $y_t = \phi_1 y_{t-1} + \varepsilon_t$
- $\Phi(r) = 0 \rightarrow 1-\phi_1 r = 0 \rightarrow r = 1/\phi_1$
- Thus, an AR($1$) has a unit root, $r = 1$, if $\phi_1 = 1$.
- If $\phi_1 = 1$, then $y_t = y_{t-1} + \varepsilon_t$ is a random walk model that produces non-stationary time series.
- But the first difference $\Delta y_t = y_t - y_{t-1} = \varepsilon_t$ is stationary (since $\varepsilon_t$ is white noise).
- Note if $|\phi_1| < 1$, then $|r|>1$ and the AR($1$) is stationary.
AR($2$): $y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t$ or $\varepsilon_t = (1 - \phi_1 L - \phi_2 L^2) y_t$
- The AR($2$) polynomial has a factorization such that $1 - \phi_1 L - \phi_2 L^2 = (1-\lambda_1 L)(1-\lambda_2 L)$ with roots $r_1 = 1/\lambda_1$ and $r_2 = 1/\lambda_2$
- Thus, an AR($2$) model is stationary if $|\lambda_1| < 1$ and $|\lambda_2| < 1$.
- Given the mapping $\phi_1 = \lambda_1 + \lambda_2$ and $\phi_2 = -\lambda_1 \lambda_2$, we can prove that the modulus of both roots exceed $1$ if
  - $|\phi_2| < 1$
  - $\phi_2 + \phi_1 < 1$
  - $\phi_2 - \phi_1 < 1$

Unit Root Test¶

The augmented Dickey-Fuller test (ADF test) has hypotheses
- $h_0$: The time series has a unit root, indicating it is non-stationary.
- $h_A$: The time series does not have a unit root, suggesting it is stationary.
Test Procedure
- Write an AR($p$) model with the AR polynomial, $\Phi(L)y_t = \varepsilon_t$.
- If the AR polynomial has a unit root, it can be written as $\Phi(L) = \varphi(L)(1 - L)$ where $\varphi(L) = 1 - \varphi_1 L - \cdots - \varphi_{p-1} L^{p-1}$.
- Define $\Delta y_t = (1 - L)y_t = y_t - y_{t-1}$.
- An AR($p$) model with a unit root is then $\Phi(L)y_t = \varphi(L)(1 - L)y_t = \varphi(L)\Delta y_t = \varepsilon_t$.
- Thus, testing $\Phi(L)$ for a unit root is equivalent to the test $h_0: \rho = 0$ vs. $h_A: \rho < 0$ in
$$ \Delta y_t = \rho y_{t-1} + \varphi_1 \Delta y_{t-1} + \cdots + \varphi_{p-1} \Delta y_{t-p+1} + \varepsilon_t $$
Intuition: A stationary process reverts to its mean, so $\Delta y_t$ should be negatively related to $y_{t-1}$.

`adfuller`¶

In statsmodels, the function adfuller() conducts an ADF unit root test.
If the p-value is smaller than $0.05$, then the series is probably stationary.

In [ ]:

# ADF Test
from statsmodels.tsa.stattools import adfuller
# Function to organize ADF test results
def adf_test(data):
    keys = ['Test Statistic','p-value','# of Lags','# of Obs']
    values = adfuller(data)
    test = pd.DataFrame.from_dict(dict(zip(keys,values[0:4])),
                                  orient='index',columns=[data.name])
    return test

ADF test results¶

In [ ]:

test = adf_test(df['d1loggdp'].dropna())
#print(test.to_markdown())

	d1loggdp
Test Statistic	-3.1799
p-value	0.0211781
# of Lags	4
# of Obs	118

The p-value < $0.05$, reject the null hypothesis that the process is non-stationary.
The number of lags is chosen by minimizing the Akaike Information Criterion (AIC) (more on that soon).

In [ ]:

dl = []
for column in df.columns:
    test = adf_test(df[column].dropna())
    dl.append(test)
results = pd.concat(dl, axis=1)
#print(results.to_markdown())

	gdp	loggdp	d1loggdp	d2loggdp
Test Statistic	4.294	-1.27023	-3.1799	-6.54055
p-value	1	0.642682	0.0211781	9.37063e-09
# of Lags	7	8	4	3
# of Obs	119	118	118	118

China GDP has a unit root after taking log.
China GDP is probably stationary after removing seasonality and converting to growth rate (p-value < $0.05$).
Second difference of China GDP is almost certainly stationary.

U.S. Unemployment Rate¶

In [ ]:

# Read Data
fred_api_key = pd.read_csv('fred_api_key.txt', header=None)
fred = Fred(api_key=fred_api_key.iloc[0,0])
data = fred.get_series('UNRATE').to_frame(name='UR')
# Plot
fig, ax = plt.subplots(figsize=(6.5,2));
ax.plot(data.UR); ax.set_title('U.S. Unemployment Rate');
ax.yaxis.set_major_formatter('{x:,.1f}%')
ax.grid(); ax.autoscale(tight=True)

No description has been provided for this image

In [ ]:

adf_UR = adf_test(data.UR)
print(adf_UR)

                        UR
Test Statistic   -3.918852
p-value           0.001900
# of Lags         1.000000
# of Obs        928.000000

ADF test suggests that U.S. unemployment rate is stationary (p-value < 0.05).
This doesn't necessarily contradict the ACF, which suggested the UR might be non-stationary, i.e., we couldn't tell with certainty.
This result may not be surprising since modeling the UR as AR($1$) led to an estimate of the AC parameter that was less than $1$, i.e., it was not close to a random walk.

Order Determination¶

There are a lot of possible choices for the orders of an ARIMA($p,d,q$) model.
Even if the data is not seasonal and already stationary, there are plenty of possibilities for an ARMA($p,q$) model.
In the ADF tests above, adfuller selected the $p$ lags for an AR($p$) model for the first-differenced time series to test for a unit root.
How do we select the best orders for an ARMA($p,q$) model?

Information Criterion¶

Information criteria are used for order determination (i.e., model selection) of ARMA($p,q$) models.
In adfuller, the default method to select lags is the Akaike Information Criterion (AIC). Another choice is the Bayesian Information Criterion (BIC) (a.k.a. Schwarz Information Criterion (SIC)).
The AIC and BIC are commonly used in economics research using time series data.
But there are other choices as well, e.g., the Hannan-Quinn Information Criterion (HQIC).
The goal of AIC, BIC, and HQIC is to estimate the out-of-sample mean square prediction error, so a smaller value indicates a better model.

AIC vs. BIC¶

Let $\mathcal{L}$ be the maximized log-likelihood of the model, $K$ be the number of estimated parameters, and $n$ be the sample size of the data.
- $AIC = -2 \mathcal{L} + 2 K$
- $BIC = -2 \mathcal{L} + \log(n) K$
The key difference is that
- AIC has a tendency to overestimate the model order.
- BIC places a bigger penalty on model complexity, hence the best model is parsimonious relative to AIC.
Consistency, i.e., as the sample size increases does the information criteria select the true model?
- AIC is not consistent, but it is asymptotically efficient, which is relevant for forecasting (more on that much later).
- BIC is consistent.
When AIC and BIC suggest different choices, most people will choose the parsimonious model using the BIC.

Likelihood Function¶

Likelihood function: Probability of the observed time series data given a model and its parameterization.
Maximum Likelihood Estimation: For a given model, find the parameterization that maximizes the likelihood function, i.e., such that the observed time series has a higher probability than with any other parameterizaiton.
The likelihood function is the joint probability distribution of all observations.
In ARMA/ARIMA models, the likelihood function is just a product of Gaussian/normal distributions since that is the distribution of the errors/innovations.
Order determination/model selection:
- Given some time series data and a menu of models, which model has the highest likelihood and is relatively parsimonious?
- e.g., for each AR model we can compute the ordinary least squares estimates and then the value of the likelihood function for that parameterization.

Example¶

Assume an MA(2) model: $y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2}$, where $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$ are i.i.d. normal errors. Note: by construction $y_t$ is uncorrelated with $y_{t-3}$
The likelihood function is the joint probability distribution of the observations $y_1, y_2, \dots, y_T$, given parameters of the model $\Theta = \{\mu, \theta_1, \theta_2, \sigma^2\}$

$$ \mathcal{L}(\Theta| y_1, \dots, y_T) = \prod_{t=1}^{T} f(y_t | y_{t-1}, y_{t-2}; \Theta) $$ where $f(y_t | y_{t-1}, y_{t-2}; \Theta)$ is the conditional normal density

$$ f(y_t | y_{t-1}, y_{t-2};\Theta) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{\varepsilon_t^2}{2\sigma^2} \right) $$

Combining the previous expressions and substituting out $\varepsilon_t$ for $y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2}$ yields

$$ \mathcal{L}(\Theta) = \prod_{t=1}^{T} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2})^2}{2\sigma^2} \right) $$
The log-likelihood function (i.e., taking natural log converts a product to a sum) is

$$ \log \mathcal{L}(\Theta) = -\frac{T}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{t=1}^{T} (y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2})^2 $$
To estimate $\mu, \theta_1, \theta_2, \sigma^2$, maximize this log-likelihood function, i.e., $\hat{\Theta} = \textrm{arg max}_\Theta \log \mathcal{L}(\Theta)$.
Because $\log \mathcal{L}(\Theta)$ is nonlinear in the parameters, numerical methods such as nonlinear optimization or the conditional likelihood approach are used in practice.

Unit Roots¶

Summary¶

AR($p$) Polynomial¶

Unit Roots of AR($p$)¶

Examples¶

Unit Root Test¶

adfuller¶

ADF test results¶

U.S. Unemployment Rate¶

Order Determination¶

Information Criterion¶

AIC vs. BIC¶

Likelihood Function¶

Example¶

`adfuller`¶