Using Statsmodels for Seasonal ARIMA modeling

Zach Zazueta
6 min readJan 26, 2020

Oftentimes in life and in business it is helpful to understand how something you are interested in will change over time. If you oversaw the MTA, you might be curious to understand the length of delays on a given train line over a series of months. If you are an investor, you might investigate weekly performance of stocks and want to predict where prices will fall next quarter. Or if you are a brand developer for a software startup, you might want to grasp the potential daily churn rate of active users on the platform.

All this data would be represented as a time series, which is quite simply a series of data points indexed over a sequence of time. The time intervals are equally spaced points in time, be that by the month, week, day, or even microsecond. Normally we could use basic regression techniques to understand a pattern and build a predictive model. However, when we want to map an outcome against the variable of time, many of these techniques are not enough to grasp the more complex patterns that are commonly present when dealing with time series data. This is where the Seasonal ARIMA model comes in.

ARIMA is an acronym for Autoregressive Integrated Moving Average and is a class of model that allows both better understanding of time series data and the ability to forecast future points. The AR portion of the model considers when a value from a time series is regressed on a certain number of lagged values from the same series. “I” stands for “Integrated” and indicates that the data was non-stationary and was differenced, a process that replaces the data values with the difference between the given value and the n-previous value. Finally, the MA portion of the model specifies that the output variable depends linearly on the current and various past values of an imperfectly predictable term.

Seasonal ARIMA models are denoted ARIMA(p,d,q)(P,D,Q)m. The lowercase p,d,q represent the terms for the non-seasonal portion of the model while the uppercase P,D,Q refer to the terms for the seasonal part of the model. The p and P terms refer to the order, or number of lags, in the autoregressive portion of the model. This order can be tentatively selected with the partial autocorrelation function (PACF) plot. The PACF is the amount of correlation between a variable and its lag not explained by correlations at all lower order lags. To plot the next few charts, the following code should be run:

In the above plot, we see the autocorrelation on the y-axis and the lag periods on the x-axis. The blue shading denotes the significance threshold. Notice that the PACF has a significant spike at the first lag. Despite being negative, the second lag is also significant, however the partial autocorrelation is much less than the first lag. The AR(p) coefficient could be either 1 or 2 for this data. The AR(P) coefficient would likely be 0 because there is no spike at either the 12 or 24 lag point, indicating no significant seasonal terms.

The q and Q terms refer to the order of the MA term and can be tentatively selected using the autocorrelation function (ACF) plot. The ACF plot is essentially a bar chart of the coefficients of correlation between a time series and lags of itself.

Like the PACF, the blue shading is the confidence band, the x-axis shows the lag, and the y-axis denotes the value of autocorrelation of the series with its lagged values. Here, the autocorrelations are significant for many lags (6–7), however the autocorrelations at 3 lags and above are largely due to the significant autocorrelation at the first and second lags (as seen in the PACF). The MA(q) term could be estimated up to 3 to be safe. The MA(Q) term would likely be 0 as there are no significant seasonal terms at the 12, 24, or 36 lag periods.

Lastly, the I term refers to the d and D terms, indicating the number of times the data should be differenced. For time series data to be considered stationary, it must meet three criteria.

· First, the mean of the series should not be a function of time, it should be constant.

· Second, the variance of the series should not be a function of time, rather it should be homoscedastic.

· Third, the covariance of the series should be a function of time.

When any of these criteria are not met, the data is non-stationary. As mentioned above, differencing is a method that can be used to transform the data into a stationary series. By computing the difference between consecutive observations, you remove the changes in the level of a time series and eliminate trend and seasonality. Differenced data could also still show a trend, in which case you would need to difference the data a second time, otherwise referred to as second order differencing. There is also such a case where you would need to difference the data on a seasonal period; an example of that might be differencing January prices in one year against January prices from the prior year. Below is an example of initial data that shows a trend, the first seasonal difference of that data which also shows a trend, and the second order seasonal difference of the data.

Clear indication of a linear upward trend
Indication of a positive linear trend remains
Finally, in the second differenced data we see data that is more stationary

Given the above graphs, I would estimate the I(d) term to be 2 and the I(D) term to be 0 due to the lack of noticeable seasonality in the original data.

Using the above graphs, we can generate a baseline understanding of what the orders should be for the Seasonal ARIMA model should be and can generate combinations for those predicted orders using the following code:

I would then plug in my above estimated terms like so:

Then, using the above parameters, run the function below to iterate through the SARIMA model outputs and examine the table of AIC scores given the inputs. The Akaike Information Criterion (AIC) lets you test how well your model fits the data set without over-fitting it. We want to balance finding the lowest AIC score with selecting the simplest solution.

In this case, it would be worth selecting model parameters with index 250 (2, 1, 0) (2, 1, 0, 12) and having a slightly higher AIC score than any of the other options listed above it.

--

--