The Dancing Economist

Saturday, August 27, 2011

Forecasting In R: The Greatest Shortcut That Failed The Ljung-Box

Okay so you want to forecast in R, but don't want to manually find the best model and go through the drudgery of plotting and so on. I have recently found the perfect function for you. Its called auto.arima and it automatically fits the best arima model to your time series. In a word- it is "brilliant". Lets take a look at its brilliance with the following examples.

So in our last post the last thing we plotted was de-trended GDP and were hoping to forecast it. R makes this not only really easy, but also hilariously fun.

Just type in the following code with dt being your de-trended series.

fitdt<-auto.arima(dt)

plot(forecast(fitdt,h=25))

Here is the marvelous plot:

Here are the summary statistics:

Series: dt
ARIMA(2,0,2)(0,0,1)[4] with zero mean

Coefficients:
ar1 ar2 ma1 ma2 sma1
1.4753 -0.5169 0.0519 0.1909 0.1539
s.e. 0.1201 0.1166 0.1236 0.0909 0.0726

sigma^2 estimated as 1528: log likelihood = -1314.11
AIC = 2640.22 AICc = 2640.55 BIC = 2661.53

In-sample error measures:
ME RMSE MAE MPE MAPE MASE
-0.02875014 39.08953811 22.33132382 -30.62351536 75.51060679 0.83683534

So it fitted the de-trended series to a ARMA(2,2). Lets take a look at what it fits regular ol' GDP to.

> fitx<-auto.arima(GDP)
> plot(forecast(fitx,h=25))
> summary(fitx)

Series: GDP
ARIMA(2,2,2)

Coefficients:
ar1 ar2 ma1 ma2
-0.078 0.4673 -0.3520 -0.5813
s.e. 0.158 0.0928 0.1611 0.1525

sigma^2 estimated as 1653: log likelihood = -1312.34
AIC = 2634.69 AICc = 2634.93 BIC = 2652.42

In-sample error measures:
ME RMSE MAE MPE MAPE MASE
3.7546824 40.4951109 22.4820107 0.1552383 0.7111309 0.3613070

So it fit it also to an ARMA(2,2), but we have that extra 2 there in the middle. That 2 was clearly not there when we fit for the trendless de-trended GDP and shows up to account for the quadratic trend of the regular ol' GDP series.

Located above is the plot of the h=25 step ahead forecast and below is some code for the actual values in list form

> fit7=forecast(fitx,h=20)
> list(fit7)

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
65 Q3 15124.35 15072.25 15176.45 15044.67 15204.03
65 Q4 15245.80 15148.82 15342.78 15097.48 15394.11
66 Q1 15359.95 15215.32 15504.58 15138.76 15581.15
66 Q2 15475.10 15285.46 15664.74 15185.07 15765.12
66 Q3 15586.75 15352.89 15820.62 15229.09 15944.42
66 Q4 15699.15 15423.24 15975.05 15277.19 16121.11
67 Q1 15809.85 15493.02 16126.69 15325.30 16294.41
67 Q2 15921.03 15564.75 16277.32 15376.14 16465.92
67 Q3 16031.39 15636.52 16426.26 15427.49 16635.29
67 Q4 16142.03 15709.51 16574.56 15480.54 16803.52
68 Q1 16252.27 15782.66 16721.87 15534.07 16970.47
68 Q2 16362.67 15856.53 16868.81 15588.59 17136.74
68 Q3 16472.86 15930.53 17015.20 15643.44 17302.29
68 Q4 16583.15 16004.92 17161.39 15698.82 17467.48
69 Q1 16693.34 16079.38 17307.30 15754.37 17632.31
69 Q2 16803.58 16154.01 17453.15 15810.15 17797.01
69 Q3 16913.77 16228.64 17598.89 15865.96 17961.57
69 Q4 17023.98 16303.31 17744.65 15921.81 18126.15
70 Q1 17134.17 16377.91 17890.43 15977.57 18290.77
70 Q2 17244.37 16452.46 18036.29 16033.25 18455.50

Let's see if this model passes the Ljung-Box test for white noise.

> Box.test(fitx$resid,type="Ljung-Box")

Box-Ljung test

data: fitx$resid
X-squared = 0.0511, df = 1, p-value = 0.8212

No! It doesn't pass the test! AHHHHHH!!! WHAT ARE WE GOING TO DO!??!?! Find out in the next post or look it up, but whatever you do-

Keep Dancin'

Steven J.

Update: This model does actually pass the "Ljung-Box" test. Please read the next post for details.

Thursday, August 25, 2011

Forecasting in R: Modeling GDP and dealing with trend.

Okay so we want to forecast GDP. How do we even begin such a burdensome ordeal?

Well each time series has 4 components that we wish to deal with and those are seasonality, trend, cyclicality and error. If we deal with seasonally adjusted data we don't have to worry about seasonality which leaves us with only three worries. If we think hard we can also forget about forecasting the error component of the time series because if we can forecast the error we probably can come up with an even better trend or cyclical model. So that leaves just two things: the trend and cyclical components.

The following graph is of real GDP and the data for this comes from FRED.

Notice the absolute rise in the series. Doesn't that strangely look similar to a quadratic equation in the form of:

y=B0+ B1*t +B2*t^2 ????

I think it does! Lets find out what the coefficients are in R and see if this is actually an appropriate model.

First get your data into R as we discussed in the previous posts:

> GDP=scan("/Users/stevensabol/Desktop/R/gdp.csv")

Then turn it into a time series so R can read it:

> GDP=ts(GDP,start=1,frequency=4)

then you have make sure that the number of observations you have in your time series will match up with the "t's"

to do this simply type

> length(GDP)
[1] 258

Then set "t" to that number:

> t=(1:258)

Then you have to regress time and time^2 on GDP. To do this type the following:

t2=t^2

and run the regression:

reg=lm(y~t+t2)

to find out the dirty details, use the summary function:

summary(reg)

Call:
lm(formula = y ~ t + t2)

Residuals:
Min 1Q Median 3Q Max
-625.43 -165.76 -36.73 163.04 796.33

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 892.656210 47.259202 18.89 <2e-16 ***
t -30.365580 0.842581 -36.04 <2e-16 ***
t2 0.335586 0.003151 106.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 251.1 on 255 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.9969
F-statistic: 4.197e+04 on 2 and 255 DF, p-value: < 2.2e-16

Okay so it appears we have a pretty great fit. With our R^2 almost close to one I think we have a winner!

Lets plot the trend and GDP values against each other so we get the picture of what we just accomplished.

First you have to write out the equation for trend using the coefficient estimates that R gave us:

> trendy= 892.656210 + -30.365580*t + 0.335586*t2

And for the Plot:

> ts.plot(GDP,trendy,col=1:2,ylab="Trend vs. Actual GDP")

Here is what it should look like:

Hey! Congratulations as you have officially tackled and defeated the trend component of this time series. Now we have to deal with the cyclicality of it all..

In order to do this we have to only look at the what's left when you remove trend from the equation. we do this by simply subtracting trend from our original GDP series to get what's left- cyclicality.

De-trended (or Cyclical component) = GDP - TREND

In R we want to write the following and plot it!

> dt=GDP-trendy
> plot(dt)

Now we want to figure out if this de-trended series is an AR, MA or ARMA process. We do this by looking at the autocorrelation and partial autocorrelation functions. Let's take a look shall we?

par(mfrow=c(2,1))
> acf(dt,100) *notice that i lag it out to the 100 observation
> pacf(dt,100)

This gives us the following graph:

Next time we will learn how to select a proper model and hopefully get some actual forecasting in there as well!

Keep dancin'

Steven J.

Tuesday, August 16, 2011

Forecasting in R: Starting From Square One

Okay in the past few posts I jumped the gun a little bit. Errors I made include rushing everything, not explaining anything and not giving my blog readers the love and respect they deserve. What am I talking about? Well before we do anything with a series we have to fit it to some sort of trend (which I did very haphazardly) and preferably the best trend we can find. What kind of criteria determined the best trend? R^2, MSE, F-test, P-values and on and on. Only then, can we go into describing cyclicality and only then can we look at whether our series passes the covariance stationary criteria. In addition to that we have to test for white noise using either the Box-Pierce or the Ljung-Box test.

So there is work to be done, but this will not discourage us! On the contrary this will motivate us to push ahead and conquer new ground.

The forecasting posts will attempt to follow the outline below.

1. Define and talk about trend. Fit the GDP series to several different trends select the best one according to various criteria. Explain the criteria by which we judge the models.

2. Produce and plot point and interval forecasts of our chosen trend model.

3. Discuss seasonality and how to easily deal with it.

4. Characterize the cycle. Discuss white noise, ACF, PACF & more. Model the cycle and select the best one based on some intuition and criteria.

5. Forecast and plot.

6. Discuss & clearly identify the implications and results.

-Keep Dancin'

Steven J.

Sunday, August 14, 2011

Breaking it up into trend and seasonal and error components

GDP=scan("/Users/stevensabol/Desktop/R/gdp.csv")
Read 258 items
> GDP=ts(GDP,start=1,frequency=4)
> dlGDP=diff(log(GDP))
> plot(stl(log(GDP),"per"))

This allows us to do a structural decomposition
log(GDP) = trend + season + error

Here is the business.

Get those plots

Type in the following to get a Q-Q plot and a histogram on top of each other

par(mfrow=c(2,1))
> hist(dlGDP,prob=T,12)
> lines(density(dlGDP))
> qqnorm(dlGDP)
> qqline(dlGDP)

the top graph says that the errors are pretty nicely distributed around the mean
and the bottom says that they are normal. Ficcissimo!

We can take a look at the correlations between the lags with the following code:

lag.plot(dlGDP,9,do.lines=F)

Holy crap- there is strong correlation! This will help give us an idea as to what model we end up choosing.

My guide for this has been the following website:
http://www.mirrorservice.org/sites/lib.stat.cmu.edu/general/tsa2/R_time_series_quick_fix.htm

Please people keep dancin'

Steven J.

Playing In R: Getting Down to Business

okay so now we are up and plotting. Let's dive into some analysis.

First we want to see if we can use the series so we have to see if its covariance stationary and that means that its mean is constant and also we can't be able to predict the errors.

we do this by plotting the difference of the terms to get the errors and if we have crazy up and downlines with no distinguishable pattern then we can can fit the errors to a line of some sort. thats a great and beautiful thing.
1. To do this plot type the following
plot.ts(diff(log(GDP)),main="Logged and Diffed")

Mission accomplished. We have an up and down picasso beauty.

We can now get fancy and smooth the series with a 2-way moving average.

fGDP(t) = ⅛GDP(t-2) + ¼GDP(t-1) + ¼GDP(t) + ¼GDP(t+1) + ⅛GDP(t+2)

To do this fancy maneuvering type the following:

k=c(.5,1,1,1,.5)
k=k/sum(k)

fGDP=filter(GDP,sides=2,k)
plot(GDP)

lines(fGDP,col="red")
lines(lowess(GDP),col="blue",lty="dashed")

that was beautiful right?

Now to test the normality via plot type in:

dlGDP=diff(log(GDP))

plot(dlGDP)

To do a formal test for normality type in:

shapiro.test(dlGDP)

This is called the Shapiro-Wilk normality test

data: dlGDP
W = 0.9642, p-value = 5.015e-06

the closer the W is to one the better and with an extremely small p-value we can reject the null hypothesis that the series is not normal and accept the alternative that it is!

Keep dancin' ladies and gents we have more on the way!

Steven J.

Playing Around With R

okay so today we will start playing around with R and will use GDP as our ginny pig.
okay so first do the following:

1. DOWNLOAD R
2. Create a folder and call it "R" on your desktop. Then type in
getwd()
this will produce the current place where R finds stuff (you know the data you need to manipulate)
3. Copy whatever getwd() produced and paste it into the following function setwd()
and type /Desktop/R after it. It should look as follows:

setwd("/Users/stevensabol/Desktop/R")

4. Go to FRED and download a data set into excel, convert the file to a CSV file. Make sure to name it something simple and straight to the point, then save it in your R folder.

5. Then type the following into R:

GDP<-read.csv("gdp.csv")

I have GDP because that is the series I am using. If you are using the CPI then type CPI where I typed GDP. Its that simple. Its easy to get confused so make sure you keep it simple.

Okay this next part is confusing because it very well might be redundant, but do it anyway.

6. Type in,

GDP=scan("/Users/stevensabol/Desktop/R/gdp.csv")

this seems to actually get the data in right.

7. Then you have to convert the series into a time series so R can understand:

GDP=ts(GDP,start=1,frequency=1)

"Start" denotes where you want your observations to start. for me it is one, because the first observation makes no sense (not that you can see it anyway). Frequency denotes the number of records in the data per observation.

8. At this point it may make sense to plot what you have. So go for it!

plot(GDP,ylab="REAL GDP",main="U.S.")

ylab denotes the y axis label
main denotes the main label

Here is what my plot looks like:

More lessons will come shortly!

Keep Dancin'

Steven J.