Content

Install statsmodels in Python

Installing the statsmodels library directly may cause some issues as it will install the latest version of the library. You can install the one used in this cheat sheet by specifying the package version. In this way, the syntax used here will match the one in your code.

Python
!pip install statsmodels

# Install specific version of the package
!pip install statsmodels==0.13.5

Importing libraries

Import some of the basic python libraries in addition to the ones specific for Time Series forecasting with ARIMA.

Python
# Basic python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ARIMA-related libraries
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

Data Preprocessing

Read your data and do basic preprocessing and visualization.

Python
# load data
data = pd.read_csv('data.csv', index_col='date', parse_dates=True)

# check for missing values
data.isnull().sum()

# fill missing values (if any)
data.fillna(method='ffill', inplace=True)

# plot the data
plt.plot(data)
plt.title('Time Series Data')
plt.xlabel('Time')
plt.ylabel('Value')
plt.show()

Stationarity Testing

Before applying an ARIMA model you should make sure it is stationary. ACF and PACF plots will allow you to see correlation and autocorrelation with previous lags, as well as infer the most suitable parameters for your model.

Python
# Augmented Dickey-Fuller Test for Stationarity
result = adfuller(data['value'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
    
# plot ACF and PACF to determine order of differencing
plot_acf(data, lags=15)
plot_pacf(data, lags=15)

You can remove or increase/decrease the number of lags in the plot_acf and plot_pacf functions depending on your data and needs. You won’t need to specify this argument unless your data is too small, in that case, you will need to choose a number of lags lower than the length of your data.

Subscribe to our newsletter for free to receive more interesting content weekly!

Model Fitting

Once you have estimated the ARIMA parameters you can fit the model. It is helpful to see the summary of your model as well as the distribution of the residuals.

Python
# fit the ARIMA model
model = ARIMA(data, order=(p, d, q))
model_fit = model.fit()

# print summary
print(model_fit.summary())

# plot residual errors
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())

Making Predictions

You can now forecast “n” periods in the future and visualize them.

Python
# forecast future values
future_values = model_fit.forecast(steps=n)

# plot future values
plt.plot(future_values)
plt.title('Forecasted Values')
plt.xlabel('Time')
plt.ylabel('Value')
plt.show()

Rolling Forecast

The most adequate way of forecasting is through the rolling forecast technique.

Python
from sklearn.metrics import mean_squared_error

# split data into training and testing sets
train_size = int(len(data) * 0.8)
train_data, test_data = data[0:train_size], data[train_size:]

# fit the model on the training set
model = ARIMA(train_data, order=(p, d, q))
model_fit = model.fit()

# make predictions on the testing set
predictions = []
for i in range(len(test_data)):
    # predict one step ahead
    yhat = model_fit.forecast()[0]
    # append prediction to list
    predictions.append(yhat)
    # add the actual observation to 
    # the training set for the next iteration
    obs = test_data[i]
    train_data = np.append(train_data, obs)
    # re-fit the model on the updated training set
    model = ARIMA(train_data, order=(p, d, q))
    model_fit = model.fit()

# convert the predictions to dataframe
predictions = pd.DataFrame(predictions, 
                           columns=['prediction'], 
                           index=test_data.index)

# calculate RMSE
rmse = np.sqrt(mean_squared_error(test_data, predictions))
print('Test RMSE: %.3f' % rmse)

# plot predictions and actual values
plt.plot(test_data)
plt.plot(predictions, color='red')
plt.title('Rolling Forecast Predictions')
plt.xlabel('Time')
plt.ylabel('Value')
plt.show()

Choosing Model Order

Autocorrelation and Partial Autocorrelation Plots

ACF and PACF plots will allow you to determine the order of your model.

Python
# plot autocorrelation and partial autocorrelation plots
plot_acf(data, lags=50)
plot_pacf(data, lags=50)
plt.show()

You won’t need to specify this argument unless your data is too small, in that case, you will need to choose a number of lags lower than the length of your data.

Grid Search for Best Parameters

Selecting the optimal parameters is not always possible by using the ACF and PACF plots. That’s why we should use a grid search to find them.

Python
# define range of p, d, q values to try
p_values = range(0, 5)
d_values = range(0, 2)
q_values = range(0, 5)

# grid search ARIMA parameters
best_aic, best_order = float("inf"), None
for p in p_values:
    for d in d_values:
        for q in q_values:
            try:
                model = ARIMA(data, order=(p, d, q))
                model_fit = model.fit()
                if model_fit.aic < best_aic:
                    best_aic, best_order = model_fit.aic, (p, d, q)
                print('ARIMA%s AIC=%.3f' % ((p, d, q), model_fit.aic))
            except:
                continue
print('Best ARIMA%s AIC=%.3f' % (best_order, best_aic))

p, d, q = best_order

Additional Techniques

Seasonal ARIMA

If our model has seasonality we can use a SARIMA model.

Python
from statsmodels.tsa.statespace.sarimax import SARIMAX

# fit the seasonal ARIMA model
model = SARIMAX(data, order=(p, d, q), seasonal_order=(P, D, Q, m))
model_fit = model.fit()

# print summary
print(model_fit.summary())

ARIMA / SARIMA with exogenous variables

You can also add exogenous variables. Bear in mind that you must have these exogenous variables available for making the forecasting.

Python
# Build and fit the ARIMAX model
model = sm.tsa.statespace.SARIMAX(
  data, 
  order=(p, d, q), 
  seasonal_order=(P,D,Q,s),
  exog=train_exog_data
  )  
model_fit = model.fit()

# Make predictions
predictions = model_fit.predict(
  start=len(train_data), 
  end=len(train_data)+len(test_data)-1, 
  exog=test_exog_data
  )

Auto ARIMA

Finding the best parameters can get difficult, especially if you are dealing with a SARIMA model. The Auto ARIMA library can help.

Python
# Install the pmdarima if you don't have it
!pip install pmdarima==2.0.3

# Import the library
from pmdarima.arima import auto_arima

# Build and fit the AutoARIMA model
model = auto_arima(data, seasonal=True, m=4, suppress_warnings=True)
model.fit(data)

# Check the model summary
model.summary()

# Make predictions
predictions = model.predict(n_periods=5)

# Format as dataframe
predictions = predictions.to_frame(name='predictions')


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *