项目作者: veeveetran

项目描述 :
Building models to forecast the number of cases of AIDS in a monthly time series from 1993 to 2002
高级语言: R
项目地址: git://github.com/veeveetran/forecasting-AIDS.git
创建时间: 2018-03-13T18:17:16Z
项目社区:https://github.com/veeveetran/forecasting-AIDS

开源协议:

下载


A Look into the Future: Forecasting Cases of AIDS

Vivian Tran
3/12/2018

Abstract

My project addresses how to find a linear model to forecast the future values of a monthly time series. I accomplished this task by determining potential ARIMA models through the autocorrelation and partial autocorrelation functions of the stationary data. I also analyzed the AICc values. After testing potential models for independence, I began the forecasting process. I plotted the original time series but removed the 10 last data points. I used the forecast() function on each model to predict ten points ahead and plotted the prediction onto the time series. Through this method, I concluded that an ARIMA(9,3,12) model created the closest fit to the ten missing points.

Introduction

My data set contains the monthly number of adults ages 30-34 who are diagnosed with AIDS in the United states from 1993 to 2002. By forecasting the future number of cases, medical suppliers and healthcare providers can plan their resources and better care for patients in the later years. I retrieved my data from CDC WONDER, CDC’s database for public-use health data. A full description of the dataset can be found on their website. I used R to analyze the data.

Packages Used

  1. library(MASS)
  2. library(qpcR)
  3. library(forecast)
  4. library(ggplot2)
  1. #read data
  2. AIDS <- read.csv("AIDS.csv", header= TRUE,sep=",")
  3. AIDS_ts<-ts(AIDS$Cases, start=c(1993,1), freq=12)

Looking at the raw time series, we can see that there is a clear trend component in our data. The number of cases decreases gradually over the years, starting from 2000 cases around the year 1994 to around 500-800 cases in 2002. There is also non-constant variance, since the amplitude of the curves changes over time. However, there does not seem to be any strong seasonal component or sharp changes in the plot.

  1. #plot data
  2. ts.plot(AIDS_ts, main="Adults ages 30-34 diagnosed with AIDS in the US, 1993-2002, monthly", ylab="Number of cases", xlab="time")

## Creating a stationary time series Since there is non-constant variance, I decided to transform my data using a Box-Cox method: From the Box-Cox plot, I chose lambda to be 0, resulting in a log-transformation:

Removing trend component

After log-transforming the data, I had constant variance. However, the data still had a trend component. To de-trend, I differenced my transformed data three times at lag=1. I chose to difference 3 times since the variance of the data kept going down until the third time I differenced. When I differenced 4 times, the variance went back up.

  1. # differencing to remove trend
  2. y.log.diff3 <- diff(y.log,3)
  3. ts.plot(y.log.diff3,main = "Data Differenced 3 Times",ylab=expression(paste(nabla,y)))

  1. #differencing at lag1 3 times has lowest variance

The data is now stationary because there is no trend component, no seasonal component, and no apparent changes in behavior over time (constant variance), fulfilling all of the assumptions to begin modeling our data using autoregressive models, moving-average models, or a combination of the two.

Determining orders p and q in ARIMA(p,d,q)

Recall that d=3 because I differenced 3 times. To determine orders q and p, I looked at the autocorrelation and partial autocorrelation functions (ACF and PACF) of the data respectively.

Based on these plots, I chose p=9 because the PACF plot cuts off after lag 9, and q=12 because ACF cuts off after lag 12. I also decided to include q=10 because it is near the cut-off.

Computing the AICc values, we see that the suggested model was ARIMA(9,3,4) because it had the lowest AIC(C) value:

  1. ## q
  2. ## p 0 1 2 3 4 5
  3. ## 0 -45.26117 -44.13300 -49.20528 -98.17205 -97.02301 -98.24525
  4. ## 1 -44.51876 -58.12204 -86.42122 -97.74570 -100.45116 -98.17965
  5. ## 2 -48.81665 -61.17443 -90.04957 -99.02602 -96.90065 -96.19086
  6. ## 3 -74.87107 -72.72797 -70.52182 -96.81354 -94.47301 -94.63523
  7. ## 4 -72.72975 -70.50051 -91.92119 -94.77582 -94.71976 -98.80855
  8. ## 5 -70.50369 -68.22881 -81.04330 -92.93250 -94.70753 -96.76475
  9. ## 6 -69.07522 -66.83608 -90.92387 -90.97813 -88.53458 -95.27095
  10. ## 7 -66.93241 -64.57819 -83.60420 -88.56724 -86.08579 -93.86304
  11. ## 8 -64.72652 -71.19911 -96.18601 -86.77014 -98.28442 -93.05992
  12. ## 9 -105.95888 -103.57372 -103.83999 -107.78245 -115.12486 -112.49422
  13. ## 10 -103.62112 -102.93389 -109.03763 -108.45496 -113.21813 -109.76424
  14. ## q
  15. ## p 6 7 8 9 10 11
  16. ## 0 -95.97585 -95.00469 -94.60843 -95.77398 -96.86195 -97.37413
  17. ## 1 -93.66192 -92.78347 -93.48173 -93.56059 -95.32118 -99.41840
  18. ## 2 -94.75303 -102.07764 -96.78960 -103.25862 -97.39758 -97.96232
  19. ## 3 -102.05958 -99.65943 -97.20928 -101.13086 -98.84381 -100.59925
  20. ## 4 -94.64313 -97.44794 -95.02486 -98.75350 -97.59833 -98.54881
  21. ## 5 -94.27261 -93.23994 -93.98237 -92.44347 -96.78693 -96.56560
  22. ## 6 -97.71835 -92.29330 -97.58880 -93.67505 -95.32422 -95.15804
  23. ## 7 -92.79064 -90.81093 -99.44669 -90.84134 -96.50648 -97.65932
  24. ## 8 -92.08319 -92.18627 -103.41549 -99.30991 -92.62578 -92.86949
  25. ## 9 -111.35722 -110.30131 -107.63040 -105.69607 -104.03337 -102.92212
  26. ## 10 -109.56074 -107.50499 -104.68475 -102.94884 -100.11352 -99.96113
  27. ## q
  28. ## p 12
  29. ## 0 -103.50696
  30. ## 1 -101.14378
  31. ## 2 -98.49308
  32. ## 3 -101.15244
  33. ## 4 -95.22089
  34. ## 5 -101.34034
  35. ## 6 -97.79322
  36. ## 7 -94.85595
  37. ## 8 -99.02252
  38. ## 9 -100.87824
  39. ## 10 -98.19247

Models and coefficients

To summarize, the potential models were ARIMA(9,3,4), ARIMA(9,3,10), ARIMA(9,3,12).

Coefficients for ARIMA(9,3,4)

  1. fit1
  1. ##
  2. ## Call:
  3. ## arima(x = y.log.diff3, order = c(9, 3, 4), method = "ML")
  4. ##
  5. ## Coefficients:
  6. ## ar1 ar2 ar3 ar4 ar5 ar6 ar7
  7. ## -1.6026 -1.6294 -1.5331 -1.2930 -1.0476 -0.7993 -0.3455
  8. ## s.e. 0.0984 0.1880 0.2483 0.2838 0.2957 0.2859 0.2554
  9. ## ar8 ar9 ma1 ma2 ma3 ma4
  10. ## 0.1316 0.0648 -0.9587 -0.0733 -0.9587 1.0000
  11. ## s.e. 0.1971 0.1053 0.0517 0.0416 0.0860 0.0676
  12. ##
  13. ## sigma^2 estimated as 0.02125: log likelihood = 41.35, aic = -54.71

Coefficients for ARIMA(9,3,10)

  1. fit2
  1. ##
  2. ## Call:
  3. ## arima(x = y.log.diff3, order = c(9, 3, 10), method = "ML")
  4. ##
  5. ## Coefficients:
  6. ## Warning in sqrt(diag(x$var.coef)): NaNs produced
  7. ## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
  8. ## -0.7650 -0.3291 -0.3175 0.1795 0.1020 -0.1336 0.0769 0.1861
  9. ## s.e. 0.1954 0.1938 NaN 0.1553 0.1413 0.1641 0.1964 0.1495
  10. ## ar9 ma1 ma2 ma3 ma4 ma5 ma6 ma7
  11. ## -0.1146 -1.9193 0.8356 -0.8011 1.3540 -0.0620 -0.3721 0.4405
  12. ## s.e. 0.1131 0.1525 0.3674 0.3674 0.1781 0.4545 0.2293 0.3538
  13. ## ma8 ma9 ma10
  14. ## -0.8922 0.3514 0.0657
  15. ## s.e. 0.3590 NaN NaN
  16. ##
  17. ## sigma^2 estimated as 0.01843: log likelihood = 47.81, aic = -55.62

Coefficients for ARIMA(9,3,12)

  1. fit3
  1. ##
  2. ## Call:
  3. ## arima(x = y.log.diff3, order = c(9, 3, 12), method = "ML")
  4. ##
  5. ## Coefficients:
  6. ## Warning in sqrt(diag(x$var.coef)): NaNs produced
  7. ## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
  8. ## -1.3845 -1.2976 -1.6356 -1.2685 -1.0397 -0.7891 0.0031 0.3462
  9. ## s.e. NaN NaN NaN NaN NaN NaN NaN 0.4382
  10. ## ar9 ma1 ma2 ma3 ma4 ma5 ma6 ma7
  11. ## 0.0956 -1.4021 0.3699 -0.4667 0.3423 0.2777 -0.8754 0.9266
  12. ## s.e. 0.1640 NaN NaN 0.9555 0.3471 0.9686 0.8964 NaN
  13. ## ma8 ma9 ma10 ma11 ma12
  14. ## -0.0574 0.2760 0.0593 -0.7391 0.2888
  15. ## s.e. NaN 0.7273 0.3270 0.8854 0.4425
  16. ##
  17. ## sigma^2 estimated as 0.01563: log likelihood = 53.07, aic = -62.13

Residual Diagnostics

ARIMA(9,3,4)

Test for Independence

Model residuals passed ljung and Box-Pierce tests for independence; data is randomly distributed

  1. # Test for independence of residuals
  2. Box.test(residuals(fit1), type="Ljung")
  1. ##
  2. ## Box-Ljung test
  3. ##
  4. ## data: residuals(fit1)
  5. ## X-squared = 4.1158e-09, df = 1, p-value = 0.9999
  1. Box.test(residuals(fit1), type="Box-Pierce")
  1. ##
  2. ## Box-Pierce test
  3. ##
  4. ## data: residuals(fit1)
  5. ## X-squared = 4.0093e-09, df = 1, p-value = 0.9999

Tests for Normality

Model residuals passed test Shapiro test for normality; data is normally distributed

  1. #test normality of residuals
  2. shapiro.test(residuals(fit1))
  1. ##
  2. ## Shapiro-Wilk normality test
  3. ##
  4. ## data: residuals(fit1)
  5. ## W = 0.98701, p-value = 0.3436
  1. ts.plot(residuals(fit1),main = "Fitted Residuals")

  1. par(mfrow=c(1,2),oma=c(0,0,2,0))
  2. # Plot diagnostics of residuals
  3. op <- par(mfrow=c(2,2))
  4. # acf
  5. acf(residuals(fit1),main = "Autocorrelation")
  6. acf((residuals(fit1))^2,main = "Autocorrelation") #show dependence/correlation between squartes;typical; use non-linear models
  7. # pacf
  8. pacf(residuals(fit1),main = "Partial Autocorrelation")
  9. # Histogram
  10. hist(residuals(fit1),main = "Histogram")

  1. # q-q plot
  2. qqnorm(residuals(fit1))
  3. qqline(residuals(fit1),col ="blue")
  4. # Add overall title
  5. title("Fitted Residuals Diagnostics", outer=TRUE)
  6. par(op)

ARIMA(9,3,10)

Test for Independence

Model residuals passed ljung and Box-Pierce tests for independence; data is randomly distributed

  1. ##
  2. ## Box-Ljung test
  3. ##
  4. ## data: residuals(fit2)
  5. ## X-squared = 0.081005, df = 1, p-value = 0.7759
  6. ##
  7. ## Box-Pierce test
  8. ##
  9. ## data: residuals(fit2)
  10. ## X-squared = 0.07891, df = 1, p-value = 0.7788

Tests for Normality

Model residuals passed test Shapiro test for normality; data is normally distributed

  1. ##
  2. ## Shapiro-Wilk normality test
  3. ##
  4. ## data: residuals(fit2)
  5. ## W = 0.98743, p-value = 0.3714

ARIMA(9,3,12)

Test for Independence

Model residuals passed ljung and Box-Pierce tests for independence; data is randomly distributed

  1. ##
  2. ## Box-Ljung test
  3. ##
  4. ## data: residuals(fit3)
  5. ## X-squared = 0.0059557, df = 1, p-value = 0.9385
  6. ##
  7. ## Box-Pierce test
  8. ##
  9. ## data: residuals(fit3)
  10. ## X-squared = 0.0058017, df = 1, p-value = 0.9393

Tests for Normality

Model residuals passed test Shapiro test for normality; data is normally distributed

  1. ##
  2. ## Shapiro-Wilk normality test
  3. ##
  4. ## data: residuals(fit3)
  5. ## W = 0.99343, p-value = 0.869

  1. #setwd("C:/Users/Vivian/Documents/PSTAT174_project")
  2. #data forecasting
  3. AIDS_ts_mod<-ts(AIDS$Cases, start=c(1993,1), freq=12, end=c(2002,4))
  1. #install.packages("forecast")
  2. #plot data with 5 less points
  3. mod <- window(AIDS_ts_mod)
  4. # fit modified data using our models
  5. fit1_AIDS = arima(mod, order=c(9,3,4),method="ML")
  6. fit2_AIDS = arima(mod, order=c(9,3,10),method="ML")
  7. fit3_AIDS = arima(mod, order=c(9,3,12),method="ML")
  8. # use models to forecast 5 values ahead
  9. fcast1_AIDS <- forecast(fit1_AIDS, h=5)
  10. fcast2_AIDS <- forecast(fit2_AIDS, h=5)
  11. fcast3_AIDS <- forecast(fit3_AIDS, h=5)
  12. plot(fcast1_AIDS, xlab = "year", ylab="Number of cases")

  1. plot(fcast2_AIDS, xlab = "year", ylab="Number of cases")

  1. plot(fcast3_AIDS, xlab = "year", ylab="Number of cases", col="seagreen")

  1. # compare our model fits visually
  2. ts.plot(AIDS_ts_mod, main="Fitted using ARIMA(9,3,4)", ylab="Number of cases", xlab="year")
  3. lines(fitted(fcast1_AIDS), col="goldenrod")

  1. ts.plot(AIDS_ts_mod, main="Fitted using ARIMA(9,3,10)", ylab="Number of cases", xlab="year")
  2. lines(fitted(fcast2_AIDS), col="skyblue")

  1. ts.plot(AIDS_ts_mod, main="Fitted using ARIMA(9,3,12)", ylab="Number of cases", xlab="year")
  2. lines(fitted(fcast3_AIDS), col="seagreen") # this one seems to be the best

  1. # compare the forecasted values with the 5 values that we removed
  2. AIDS <- read.csv("AIDS.csv", header= TRUE,sep=",")
  3. AIDS_ts_orig<-ts(AIDS$Cases, start=c(1993,1), freq=12) # original data
  4. plot(fcast1_AIDS, xlab = "year", ylab="Number of cases")
  5. lines(AIDS_ts_orig)

  1. plot(fcast2_AIDS, xlab = "year", ylab="Number of cases") # this one seems to do better overall
  2. lines(AIDS_ts_orig)

  1. plot(fcast3_AIDS, xlab = "year", ylab="Number of cases")
  2. lines(AIDS_ts_orig)