Summary¶
- The presence of unit roots make time series non-stationary.
- Trend Stationarity: A trend-stationary time series has a deterministic time trend and is stationary after the trend is removed.
- Difference Stationarity: A difference-stationary time series has a stochastic time trend and becomes stationary after differencing.
- Unit root tests, e.g., the augmented Dickey-Fuller test (ADF test) are used to detect the presence of a unit root in a time series.
- Unit root tests help determine if differencing is required to achieve stationarity, suggesting a stochastic time trend.
- If an AR model has a unit root, it is called a unit root process, e.g., the random walk model.
AR($p$) Polynomial¶
Consider the AR($p$) model: $y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \varepsilon_t$
Whether an AR($p$) produces stationary time series depends on the parameters $\phi_j$, e.g., for AR($1$), $|\phi_1|<1$
This can be written in terms of an AR polynomial with the lag operator, $L$, defined as $L^jy_t \equiv y_{t-j}$
$$ \varepsilon_t = y_t - \phi_1 y_{t-1} - \phi_2 y_{t-2} - \dots - \phi_p y_{t-p} = (1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p)y_t $$
Define the AR polynomial as $\Phi(L) \equiv 1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p$
Unit Roots of AR($p$)¶
- To find the roots of an AR($p$), set $\Phi(r) = 0$ and solve for $r$ (which is sometimes complex), and the number of roots will equal the order of the polynomial.
- An AR($p$) has a unit root if the modulus of any of the roots equal $1$.
- For each unit root, the data must differenced to make it stationary, e.g., if there are $2$ unit roots then the data must be differenced twice.
- If the modulus of all the roots exceeds $1$, then the AR($p$) is stationary.
Examples¶
- AR($1$): $y_t = \phi_1 y_{t-1} + \varepsilon_t$
- $\Phi(r) = 0 \rightarrow 1-\phi_1 r = 0 \rightarrow r = 1/\phi_1$
- Thus, an AR($1$) has a unit root, $r = 1$, if $\phi_1 = 1$.
- If $\phi_1 = 1$, then $y_t = y_{t-1} + \varepsilon_t$ is a random walk model that produces non-stationary time series.
- But the first difference $\Delta y_t = y_t - y_{t-1} = \varepsilon_t$ is stationary (since $\varepsilon_t$ is white noise).
- Note if $|\phi_1| < 1$, then $|r|>1$ and the AR($1$) is stationary.
- AR($2$): $y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t$ or $\varepsilon_t = (1 - \phi_1 L - \phi_2 L^2) y_t$
- The AR($2$) polynomial has a factorization such that $1 - \phi_1 L - \phi_2 L^2 = (1-\lambda_1 L)(1-\lambda_2 L)$ with roots $r_1 = 1/\lambda_1$ and $r_2 = 1/\lambda_2$
- Thus, an AR($2$) model is stationary if $|\lambda_1| < 1$ and $|\lambda_2| < 1$.
- Given the mapping $\phi_1 = \lambda_1 + \lambda_2$ and $\phi_2 = -\lambda_1 \lambda_2$, we can prove that the modulus of both roots exceed $1$ if
- $|\phi_2| < 1$
- $\phi_2 + \phi_1 < 1$
- $\phi_2 - \phi_1 < 1$
Unit Root Test¶
The augmented Dickey-Fuller test (ADF test) has hypotheses
- $h_0$: The time series has a unit root, indicating it is non-stationary.
- $h_A$: The time series does not have a unit root, suggesting it is stationary.
Test Procedure
- Write an AR($p$) model with the AR polynomial, $\Phi(L)y_t = \varepsilon_t$.
- If the AR polynomial has a unit root, it can be written as $\Phi(L) = \varphi(L)(1 - L)$ where $\varphi(L) = 1 - \varphi_1 L - \cdots - \varphi_{p-1} L^{p-1}$.
- Define $\Delta y_t = (1 - L)y_t = y_t - y_{t-1}$.
- An AR($p$) model with a unit root is then $\Phi(L)y_t = \varphi(L)(1 - L)y_t = \varphi(L)\Delta y_t = \varepsilon_t$.
- Thus, testing $\Phi(L)$ for a unit root is equivalent to the test $h_0: \rho = 0$ vs. $h_A: \rho < 0$ in
$$ \Delta y_t = \rho y_{t-1} + \varphi_1 \Delta y_{t-1} + \cdots + \varphi_{p-1} \Delta y_{t-p+1} + \varepsilon_t $$
Intuition: A stationary process reverts to its mean, so $\Delta y_t$ should be negatively related to $y_{t-1}$.
adfuller¶
- In
statsmodels, the functionadfuller()conducts an ADF unit root test. - If the p-value is smaller than $0.05$, then the series is probably stationary.
# ADF Test
from statsmodels.tsa.stattools import adfuller
# Function to organize ADF test results
def adf_test(data):
keys = ['Test Statistic','p-value','# of Lags','# of Obs']
values = adfuller(data)
test = pd.DataFrame.from_dict(dict(zip(keys,values[0:4])),
orient='index',columns=[data.name])
return test
ADF test results¶
test = adf_test(df['d1loggdp'].dropna())
#print(test.to_markdown())
| d1loggdp | |
|---|---|
| Test Statistic | -3.1799 |
| p-value | 0.0211781 |
| # of Lags | 4 |
| # of Obs | 118 |
- The p-value < $0.05$, reject the null hypothesis that the process is non-stationary.
- The number of lags is chosen by minimizing the Akaike Information Criterion (AIC) (more on that soon).
dl = []
for column in df.columns:
test = adf_test(df[column].dropna())
dl.append(test)
results = pd.concat(dl, axis=1)
#print(results.to_markdown())
| gdp | loggdp | d1loggdp | d2loggdp | |
|---|---|---|---|---|
| Test Statistic | 4.294 | -1.27023 | -3.1799 | -6.54055 |
| p-value | 1 | 0.642682 | 0.0211781 | 9.37063e-09 |
| # of Lags | 7 | 8 | 4 | 3 |
| # of Obs | 119 | 118 | 118 | 118 |
- China GDP has a unit root after taking
log. - China GDP is probably stationary after removing seasonality and converting to growth rate (p-value < $0.05$).
- Second difference of China GDP is almost certainly stationary.
U.S. Unemployment Rate¶
# Read Data
fred_api_key = pd.read_csv('fred_api_key.txt', header=None)
fred = Fred(api_key=fred_api_key.iloc[0,0])
data = fred.get_series('UNRATE').to_frame(name='UR')
# Plot
fig, ax = plt.subplots(figsize=(6.5,2));
ax.plot(data.UR); ax.set_title('U.S. Unemployment Rate');
ax.yaxis.set_major_formatter('{x:,.1f}%')
ax.grid(); ax.autoscale(tight=True)
adf_UR = adf_test(data.UR)
print(adf_UR)
UR Test Statistic -3.918852 p-value 0.001900 # of Lags 1.000000 # of Obs 928.000000
- ADF test suggests that U.S. unemployment rate is stationary (p-value < 0.05).
- This doesn't necessarily contradict the ACF, which suggested the UR might be non-stationary, i.e., we couldn't tell with certainty.
- This result may not be surprising since modeling the UR as AR($1$) led to an estimate of the AC parameter that was less than $1$, i.e., it was not close to a random walk.
Order Determination¶
- There are a lot of possible choices for the orders of an ARIMA($p,d,q$) model.
- Even if the data is not seasonal and already stationary, there are plenty of possibilities for an ARMA($p,q$) model.
- In the ADF tests above,
adfullerselected the $p$ lags for an AR($p$) model for the first-differenced time series to test for a unit root. - How do we select the best orders for an ARMA($p,q$) model?
Information Criterion¶
- Information criteria are used for order determination (i.e., model selection) of ARMA($p,q$) models.
- In
adfuller, the default method to select lags is the Akaike Information Criterion (AIC). Another choice is the Bayesian Information Criterion (BIC) (a.k.a. Schwarz Information Criterion (SIC)). - The AIC and BIC are commonly used in economics research using time series data.
- But there are other choices as well, e.g., the Hannan-Quinn Information Criterion (HQIC).
- The goal of AIC, BIC, and HQIC is to estimate the out-of-sample mean square prediction error, so a smaller value indicates a better model.
AIC vs. BIC¶
- Let $\mathcal{L}$ be the maximized log-likelihood of the model, $K$ be the number of estimated parameters, and $n$ be the sample size of the data.
- $AIC = -2 \mathcal{L} + 2 K$
- $BIC = -2 \mathcal{L} + \log(n) K$
- The key difference is that
- AIC has a tendency to overestimate the model order.
- BIC places a bigger penalty on model complexity, hence the best model is parsimonious relative to AIC.
- Consistency, i.e., as the sample size increases does the information criteria select the true model?
- AIC is not consistent, but it is asymptotically efficient, which is relevant for forecasting (more on that much later).
- BIC is consistent.
- When AIC and BIC suggest different choices, most people will choose the parsimonious model using the BIC.
Likelihood Function¶
- Likelihood function: Probability of the observed time series data given a model and its parameterization.
- Maximum Likelihood Estimation: For a given model, find the parameterization that maximizes the likelihood function, i.e., such that the observed time series has a higher probability than with any other parameterizaiton.
- The likelihood function is the joint probability distribution of all observations.
- In ARMA/ARIMA models, the likelihood function is just a product of Gaussian/normal distributions since that is the distribution of the errors/innovations.
- Order determination/model selection:
- Given some time series data and a menu of models, which model has the highest likelihood and is relatively parsimonious?
- e.g., for each AR model we can compute the ordinary least squares estimates and then the value of the likelihood function for that parameterization.
Example¶
Assume an MA(2) model: $y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2}$, where $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$ are i.i.d. normal errors. Note: by construction $y_t$ is uncorrelated with $y_{t-3}$
The likelihood function is the joint probability distribution of the observations $y_1, y_2, \dots, y_T$, given parameters of the model $\Theta = \{\mu, \theta_1, \theta_2, \sigma^2\}$
$$ \mathcal{L}(\Theta| y_1, \dots, y_T) = \prod_{t=1}^{T} f(y_t | y_{t-1}, y_{t-2}; \Theta) $$ where $f(y_t | y_{t-1}, y_{t-2}; \Theta)$ is the conditional normal density
$$ f(y_t | y_{t-1}, y_{t-2};\Theta) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{\varepsilon_t^2}{2\sigma^2} \right) $$
Combining the previous expressions and substituting out $\varepsilon_t$ for $y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2}$ yields
$$ \mathcal{L}(\Theta) = \prod_{t=1}^{T} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2})^2}{2\sigma^2} \right) $$
The log-likelihood function (i.e., taking natural log converts a product to a sum) is
$$ \log \mathcal{L}(\Theta) = -\frac{T}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{t=1}^{T} (y_t - \mu - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2})^2 $$
To estimate $\mu, \theta_1, \theta_2, \sigma^2$, maximize this log-likelihood function, i.e., $\hat{\Theta} = \textrm{arg max}_\Theta \log \mathcal{L}(\Theta)$.
Because $\log \mathcal{L}(\Theta)$ is nonlinear in the parameters, numerical methods such as nonlinear optimization or the conditional likelihood approach are used in practice.