# A tibble: 6 × 4
year month total_cases date
<dbl> <ord> <dbl> <date>
1 2016 January 32047 2016-01-01
2 2016 February 26018 2016-02-01
3 2016 March 33362 2016-03-01
4 2016 April 24993 2016-04-01
5 2016 May 45026 2016-05-01
6 2016 June 68702 2016-06-01
Notes:
Columns: year, month, total_cases, date (monthly start dates).
We standardize month ordering for consistent plotting.
Visual Inspection
Before modeling, always visualize the raw data to identify obvious patterns.
Do you see a trend? (Is it generally increasing or decreasing?)
Do you see seasonality? (Is there a repeating pattern within each year?)
📦 Handling Time Series Data
Time Series Objects in R
ts: The classic base R time series object. Required for the forecast package.
zoo / xts: Robust options for irregular time series or finance.
tsibble: Modern “tidy” time series frames (used with fable).
For this module, we focus on ts objects for ARIMA modeling.
Creating a ts Object
Use ts() for the forecast package functions like auto.arima.
Code
# We need:# 1. The data vector ($total_cases)# 2. The start date (Year, Month)# 3. The frequency (12 for monthly)malaria_ts <-ts(malaria_data$total_cases, start =c(2016, 1), frequency =12)
Code
# Plot it with the generic `autoplot`autoplot(malaria_ts) +geom_line(color ="steelblue", linewidth =0.7) +labs(title ="Malaria Cases as a 'ts' Object") +theme_minimal()
Seasonal plot to inspect monthly patterns
Code
# Seasonal plot to inspect monthly patternslibrary(forecast)ggseasonplot(malaria_ts, linewidth =0.7) +scale_y_continuous(labels =label_number(scale_cut =cut_short_scale())) +labs(title ="Seasonal Plot of Malaria Cases") +theme_minimal()
🔍 Diagnostics & Decomposition
What is Decomposition?
Decomposition means splitting the time series into its components: \(Y(t) = Trend(t) + Seasonal(t) + Residual(t)\) (Additive)
A stationary series is one whose statistical properties (mean, variance, covariance) do not change over time.
It has no trend and no changing seasonality.
Why care? Most time series models (like ARIMA) assume the data is stationary.
How? We often difference the data (e.g., \(Y_t - Y_{t-1}\)) to make it stationary. This is the “I” (for Integrated) in ARIMA.
Types:
Strict (all moments invariant) vs. weak/second-order (mean/variance/ACF invariant)
Trend-stationary (deterministic trend removable) vs. difference-stationary (unit root)
Exercise:
Inspect the malaria plot: does level or variance change over time? What transformation might help?
Checking for Stationarity: ACF/PACF
ACF (Autocorrelation Function): Measures the correlation between \(Y_t\) and \(Y_{t-k}\) (past values).
PACF (Partial Autocorrelation Function): Measures the correlation between \(Y_t\) and \(Y_{t-k}\)after removing the effects of the lags in between.
. . .
ACF/PACF Rules of Thumb:
Non-stationary (Trend): ACF plot shows a very slow, linear decay.
Seasonal: ACF plot shows strong spikes at the seasonal lags (12, 24, 36…).
Stationary AR(p): ACF decays, PACF cuts off after lag p.
Stationary MA(q): ACF cuts off after lag q, PACF decays.
ACF/PACF Plots in R
Code
# Create training and test sets# Training: 2016-2022 (7 years, 84 obs)# Test: 2023 (1 year, 12 obs)train_ts <-window(malaria_ts, end =c(2022, 12))test_ts <-window(malaria_ts, start =c(2023, 1))
Let’s look at the ACF for our training data.
Code
# We use ggAcf and ggPacf from the 'forecast' packageggAcf(train_ts, lag.max =36) +labs(title ="ACF for Malaria Training Cases") +theme_bw()
Observation: Very slow decay and strong spikes at 12, 24, 36. This data is not stationary and is seasonal.
Stationarity Tests
We use statistical tests to be more formal.
ADF (Augmented Dickey-Fuller) Test:
\(H_0\) (Null): The series is non-stationary (has a unit root).
\(H_a\) (Alternative): The series is stationary.
Goal: We want a small p-value (< 0.05) to reject \(H_0\) and conclude the series is stationary.
. . .
Code
library(tseries) # For adf.test# Run ADF test on the training dataadf.test(train_ts)
Augmented Dickey-Fuller Test
data: train_ts
Dickey-Fuller = -2.3861, Lag order = 4, p-value = 0.4175
alternative hypothesis: stationary
Result: The p-value is large (e.g., > 0.1), so we fail to reject \(H_0\). We conclude the data is non-stationary (as we saw in the plot).
Differencing
Code
# Create a first-differenced series:train_ts_diff1 =diff(train_ts, lag =1)# create a seasonal differenced seriestrain_ts_seas_diff <-diff(train_ts, lag =12)# create combined differences (non seasonal and seasonal differencing)train_ts_comb_diff <-diff(diff(train_ts, lag =12))
ARIMA (Autoregressive Integrated Moving Average) is a popular statistical method used for analyzing and forecasting time series data. It predicts future points in a series by relying on its own past values, errors, and trends.
ARIMA(p, d, q) combines three key components to model:
AR (p): Autoregressive part
\(X_t\) is a weighted sum of p past values.
“The value today is a function of yesterday’s value.”
I (d): Integrated part
d is the number of differences needed to make the series stationary.
“We model the change from yesterday, not the raw value.”
MA (q): Moving Average part
\(X_t\) is a weighted sum of q past forecast errors.
“The value today is a function of yesterday’s error.”
For a stationary series (after appropriate differencing), we typically model it with ARMA/ARIMA.
What is SARIMA?
SARIMA = Seasonal ARIMA. It adds seasonal components.
SARIMA(p, d, q) (P, D, Q)[s]
(p, d, q): The non-seasonal parts (from the previous slide).
(P, D, Q): The seasonal AR, I, and MA parts.
[s]: The seasonal period (s=4,12, etc for us).
Manually finding these 6 numbers is not easy!
ARMA and SARMA Equations
For a stationary series (\(X_t\)) (after differencing), an ARMA((p,q)) model is
\(s\): seasonal period (e.g., \(s = 12\) for monthly data).
\(d\): order of nonseasonal differencing.
\(D\): order of seasonal differencing.
\(P\) and \(Q\): order of seasonal AR and MA components, respectively.
\(\Phi_1, \dots, \Phi_P\): seasonal AR parameters.
\(\Theta_1, \dots, \Theta_Q\): seasonal MA parameters.
Sometimes a constant or drift term is included (handled via include.constant or include.drift in forecast::Arima).
ACF/PACF Patterns for ARMA Models
(for stationary series, after appropriate differencing)
Model
ACF behavior
PACF behavior
AR(p)
Tails off
Cuts off after lag (p)
MA(q)
Cuts off after lag (q)
Tails off
ARMA(p,q)
Tails off
Tails off
Notes:
“Cuts off” = near-zero after that lag.
“Tails off” = decays gradually, often geometrically or in a damped pattern.
Use these patterns on the differenced (stationary) series to guide choice of (p) and (q).
4. Building ARIMA/SARIMA Model
We’ll fit the model to our training data (train_ts).
Use auto.arima() for a strong baseline; validate via residual diagnostics and Ljung–
Forecast with intervals; compare models by AIC, BIC, and AICc and accuracy
Code
# Create training and test sets: last 12 months as test# Training: 2016-2022 (7 years, 84 obs)# Test: 2023 (1 year, 12 obs)train_ts <-window(malaria_ts, end =c(2022, 12))test_ts <-window(malaria_ts, start =c(2023, 1))
Code
# Find the best ARIMA model automaticallylibrary(forecast)fit_auto <-auto.arima( train_ts,seasonal =TRUE,stepwise =FALSE, approximation =FALSE)fit_auto
# Example manual model: seasonal difference often D=1 for monthly# Use ACF/PACF patterns to pick p,q,P,Q; here we try SARIMA(4,1,3)(0,1,0)[12]fit_manual <- forecast::Arima(train_ts,order =c(4,1,3),seasonal =c(0,1,0),include.constant =FALSE)fit_manual
Our model (fit_auto) was built only on the training set.
We validated it on the test set.
Now, to forecast the real future, we should re-fit the model on ALL available data (malaria_ts).
Code
# Example solution for the wrap-up exercise:fit_best <- forecast::auto.arima(malaria_ts, stepwise =FALSE, approximation =FALSE)fc_best <- forecast::forecast(fit_best, h =12)autoplot(fc_best) +labs(title ="12-Month Forecast on full series",y ="Total cases") +scale_y_continuous(labels =label_number(scale_cut =cut_short_scale())) +theme_minimal(base_size =14)