Wednesday, August 31, 2011

Story of the Ljung-Box Blues: Progress Not Perfection

In the last post we determined that our ARIMA(2,2,2) model failed to pass the Ljung-Box test.  In todays post we seek to completely discredit the last posts claim and finally arrive at some needed closure.

The Ljung-Box is first performed on the series at hand, because it means that at least one of the autocorrelation functions is non zero. What does that mean?  Well, it means that we can forecast because the values in the series can be used to predict each other.  It helps us numerically come to the conclusion that the series itself is not a white noise process and so its movements are not completely random. 

When we perform the Ljung-Box in R on GDP we get the following results:

> Box.test(GDP,lag=20,type="Ljung-Box")

Box-Ljung test

data:  GDP 
X-squared = 4086.741, df = 20, p-value < 2.2e-16

What this output is telling us is to reject the null hypothesis that all of the autocorrelation functions out to 20 are zero.  At least one of these is non zero.  This gives us the green light to use AR, MA or ARMA in our approach towards modeling and forecasting.

The second time the Ljung-Box shows up is when we want to test to see if the error terms or residuals are white noise.  A good forecasting model will have to have zero correlation between its residuals or else you could forecast them.  It naturally follows that if you can forecast the error terms then a better model must exist.  

Here is the Ljung-Box Q test out to the 26th Lag:

> LjungBoxTest(res,k=2,StartLag=1)

  m    Qm     p-value:
  1  0.05     0.82118640
  2  0.05     0.81838128
  3  0.72     0.39541957
  4  0.75     0.68684256
  5  2.00     0.57224678
  6  2.41     0.66164894
  7  3.24     0.66255593
  8  9.05     0.17070965
  9 15.14    0.03429650
 10 15.54   0.04946816
 11 15.64   0.07487629
 12 22.14   0.01442010
 13 22.51    0.02073827
 14 22.72    0.03020402
 15 23.24    0.03889525
 16 23.24    0.05648292
 17 23.29    0.07809501
 18 26.81    0.04367819
 19 30.20    0.02494375
 20 30.20    0.03554725
 21 31.56    0.03500150
 22 32.46    0.03868275
 23 32.47    0.05241222
 24 34.14    0.04748629
 25 35.47    0.04672181
 26 36.28    0.05151986

As you can see with your very special eyes we fail to reject the null hypothesis out to the 8th lag.  So we have no evidence of residual autocorrelation and hence we have no evidence to contradict the assumption that the errors are white noise.  Our model checks out people!

Now if you want to plot the Ljung-Box just type in the following:

> x<-LjungBoxTest(res,k=2,StartLag=1)
> plot(x[,3],main="Ljung-Box Q Test",ylab="P-values",xlab="Lag")
The white noise process should also have a normal distribution with a mean of 0.  To do a rough test of normality we can run a simple Q-Q plot in R.  The values are normal if they rest on a line and aren't all over the place.

The following command gives us this plot:

qqnorm(res)
qqline(res)



The Q-Q plot seems to suggest normality- however there are some formal tests we can run in R to verify this assumption.  Two formal tests are the Jarque-Bera Test and the Shapiro-Wilk normality test.  Both have a null hypothesis that the series follows a normal distribution and therefore a rejection of the null suggests that the series does not follow a normal distribution.

> jarque.bera.test(res)

Jarque Bera Test

data:  res 
X-squared = 9660.355, df = 2, p-value < 2.2e-16

> shapiro.test(res)

Shapiro-Wilk normality test

data:  res 
W = 0.7513, p-value < 2.2e-16

Wow! Both of these test strongly reject the possibility of the white noise process having a normal distribution. 
We can still see if the mean of the residuals is zero by simply typing the following into R:

> mean(model$res)
[1] 3.754682

The mean is clearly not zero which implies we have some sort of a problem. In fact, it means that the Ljung-Box was not the proper test because it requires:

A. The time series be stationary
B. The white noise process has a normal distribution with mean zero.

Given that we just determined that the mean is definitely not zero and that both of our formal tests rejected the possibility of our white noise process following a normal distribution, we do indeed face a serious problem.  This is a evolving and growing period for us forecasting in R novices.  I don't have all the answers (clearly), but strides are made in the right direction every day. The greatest thing about making mistakes and tripping in the forest is getting back up and getting the hell out of there.  

Please keep posted and keep dancin',

Steven J.  

4 comments:

  1. Hello Mr. Dancer,

    I tried to follow your example but I found that when I type "LjungBoxTest(res,k=2,StartLag=1)" R gives an error. I guess that there was an error an I changed to Box.test(res,k=2,StartLag=1) but the error is still there. Where the variable res comes from?
    Regards,
    Juan

    ReplyDelete
  2. res is the vector that holds all of your residual values (or error terms) for your model. Its the difference between your models values and the actual values. In R just type something like res=resid(y)
    with y being your fitted model. With respect to the rest of your issues you may not have the packages necessary to run the ljungBoxtest. Just go to the install directory and search these terms to find the necessary packages.

    good luck

    steven

    ReplyDelete
  3. Sorry, I know this post is seven years old, but I think it's pretty clear from the Q-Q plot that the residuals are not normally distributed but have fat tails instead.

    Here's my favorite "cheat sheat" on Q-Q plots:

    https://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot/101290#101290

    As I recall, however, excess kurtosis but no skew means the point forecast from ARIMA is okay, but that the prediction intervals may not be accurate.

    ReplyDelete