`autostm`

(Automatic Structural Time Series Model) is designed to automatically detect the appropriate decomposition for a univariate time series into trend, seasonal, and cycle components using a state space approach. The package also has the functionality perform structural interpolation: i.e. interpolated a quarterly series into a monthly series using either end of period, period average, or period sum methods. The unobserved components are estimated with the Kalman filter and all model parameters are estimated using maximum likelihood. The Kalman filter and smoother are implemented in `Rcpp`

so it is reasonably fast. This package is designed to handle time series of any frequency (standard or not), series with multiple seasonalities, seasonalities with fractional periodicities, missing data imputation, and irregularly spaced dates (i.e. daily data with missing data due to weekends and holidays, etc.).

With ease-of-use in mind, all the user has to do is provide a two column data frame: one column with dates and one column with the univariate time series. Everything else is handled by the algorithm including data frequency detection (observations per year), trend specification, cycle specification, seasonal specification, missing data imputation, anomaly detection, structural break detection, and whether or not to log transform the data. The user can manually set the preferred decomposition if desired, however. All components are assumed to be stochastic, which allows for time-varying trend, cycle, and seasonalities, unless the user specifies otherwise using the `det_trend`

, `det_drift`

, `det_seas`

, `det_cycle`

, and `det_obs`

arguments in `stsm_estimate`

. The algorithm can handle secondly, minutely, hourly, daily, monthly, and quarterly frequencies as well as daily data that is weekday only. It is also capable of handling non-standard frequencies (i.e. frequencies that are non of the above). To build an intepolation model, all the user has to do is set `interpolate`

and `interpolate_method`

. The user only needs to call `stsm_estimate`

to fit the model and `stsm_forecast`

to filter and forecast the data. Additionally, the user can call `stsm_detect_anomalies`

and `stsm_detect_breaks`

to perform anomaly and structural break detection respectively.

The model is based on the decomposition

\[ Y_t = T_t + C_t + S_t + B X_t + e_t \]

where \(Y_t\) is a univariate time series, \(T_t\) is the trend component, \(C_t\) is the cycle component, \(S_t\) is the seasonal component, \(X_t\) is optional exogenous data, and \(e_t \sim N(0, \sigma_e^2)\) is the observation error. The seasonal and cycle components are assumed to be of a trigonometric form, while the trend is assumed to be one of three types: a random walk, a random walk with an AR(1) drift, or a double random walk (random walk with a random walk drift).

For state space model is written as \[ Y_t = A + H \beta + BX + e_t, e_t \sim N(0, R)\\ \beta_t = D + F \beta_{t-1} + u_t, u_t \sim N(0, Q) \]

where the first equation is the observation equation and the second equation is the transition equation. \(H\) is the observation matrix and \(R\) is the observation error-covariance matrix. \(\beta_t\) is the vector of the unobserved components (trend, seasonal, and cycle), \(F\) is the transition matrix, and \(Q\) is the transition error-covariance matrix.

If trend is “random-walk” the trend model is specified as the I(1) process \[ T_t = T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \]

If trend is “random-walk-drift”, the trend model is specified as the I(1) process \[ T_t = D_{t-1} + T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \\ D_t = d + \phi_d D_{t-1} + n_t, n_t \sim N(0, \sigma_d^2) \] where \(D_t\) is the drift.

If trend is “double-random-walk”, the trend model is specified as the I(2) process \[ T_t = D_{t-1} + T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \\ D_t = D_{t-1} + n_t, n_t \sim N(0, \sigma_d^2) \]

The cycle is modeled using either the trigonometric process

\[ \begin{bmatrix} c_t \\ c_t^{*} \end{bmatrix} = \phi_c \begin{bmatrix} cos(\lambda) & sin(\lambda) \\ -sin(\lambda) & cos(\lambda) \end{bmatrix} \begin{bmatrix} c_{t-1} \\ c_{t-1}^{*} \end{bmatrix} + \begin{bmatrix} u_t \\ u_t^{*} \end{bmatrix}, u_t, u_t^{*} \sim N(0, \sigma_c^2) \] where \(\lambda\) is the frequency of the cycle and \(\phi_c \in{(0,1)}\) to make the cycle stationary for forecasting purposes, or the ARMA(2,1) process

\[ \begin{bmatrix} c_t \\ c_{t-1} \\ \vdots \\ c_{t-p+2} \\ c_{t-p+1} \\ e_t \\ e_{t-1} \\ \vdots \\ e_{t-q+2} \\ e_{t-q+1} \end{bmatrix} = \begin{bmatrix} \phi_{c,1} & \phi_{c,2} & \ldots & \phi_{c,p-1} & \phi_{c,p} & \theta_{c,1} & \theta_{c,2} & \ldots & \theta_{c,q-1} & \theta_{c,q} \\ 1 & 0 & \ldots & 0 & 0 & \ldots & \ldots & \ldots & \ldots & 0 \\ 0 & \ddots & & \vdots & \vdots & \ddots & & \ddots & & \vdots \\ \vdots & & \ddots & \vdots & \vdots & & \ddots& & \ddots & \vdots \\ 0 & \ldots & \dots & 1 & 0 & \ldots & \ldots & \ldots & \dots & 0 \\ 0 & \ldots & \ldots & \ldots & 0 & \ldots & \ldots & \ldots & \ldots & 0 \\ \vdots & \ddots & & & \vdots & 1 & 0 & \ldots & 0 & \vdots \\ \vdots & & \ddots & & \vdots & 0 & \ddots & & \vdots & \vdots \\ \vdots & & & \ddots & \vdots & \vdots & & \ddots & \vdots & \vdots \\ 0 & \ldots & \ldots & \ldots & 0 & 0 & \ldots & \ldots & 1 & 0\\ \end{bmatrix} \begin{bmatrix} c_{t-1} \\ c_{t-2} \\ \vdots \\ c_{t-p+1} \\ c_{t-p} \\ e_{t-1} \\ e_{t-2} \\ \vdots \\ e_{t-q+1} \\e_{t-q} \end{bmatrix} + \begin{bmatrix} e_t \\ 0 \\ \vdots \\ \vdots \\ 0 \\ e_t \\ 0 \\ \vdots \\ \vdots \\0 \end{bmatrix}, e_t \sim N(0, \sigma_c^2) \] where \(p\) is the autoregressive lag length and \(q\) is the moving average lag length. The ARMA(p,q) specification can have real or complex roots, while the trigonometric specification will only have complex roots. The trigonometric specification is a special case of an ARMA(2,1) specification.

If the roots are complex, the dynamics will be damped oscillations about the trend, which may not always be desirable. If the user would like to remove oscillations from the forecast that are caused by the complex roots, the user can specify, `dampen_cycle = TRUE`

in `stsm_forecast`

. This option will smooth the forecast of the cycle into the trend using a fitted sigmoid curve.

Seasonality is modeled using a Fourier series, which is the sum of sin-cos waves with each wave representing a seasonal pattern. Total seasonality is \[ s_t = \sum_{j = 1}^{h} s_t^j \]

where each season \(s_t^j\) is modeled using the trigonometric specification below

\[ \begin{bmatrix} s_t^j \\ s_t^{j,*} \end{bmatrix} = \begin{bmatrix} cos(\lambda_j) & sin(\lambda_j) \\ -sin(\lambda_j) & cos(\lambda_j) \end{bmatrix} \begin{bmatrix} s_{t-1}^j \\ s_{t-1}^{j,*} \end{bmatrix} + \begin{bmatrix} w_t^j \\ w_t^{j,*} \end{bmatrix}, w_t^j \sim N(0, \sigma_j^2) \] where \(s_t^{j,*}\) exists by construction and is just an auxiliary variable, \(\lambda_j = \frac{2\pi j}{f}\) is the frequency of season \(j\), \(j\) represents the number of periods per year (i.e. \(j = 52\) represents weekly seasonality at 52 periods per year), and \(f\) is the frequency of the data (i.e. the number of observations per year). The seasons to include in the model are automatically detected in the algorithm using wavelet analysis. Each \(\lambda_j\) is predetermined from wavelet analysis and are not estimated parameters.

The automatic model specification algorithm follows the procedure below. Rather than using brute force selection by checking every possible model combination and selecting the model that minimizes a selection criteria like AIC, which can be computationally intensive and time consuming, the procedure uses a series a statistical tests to determine model specification. However, the user is free to write their own brute force model selection routine from the functions provided in this package as `stsm_estimate`

returns a table with the model’s log likelihood, AIC, AICc, and BIC.

The first step is to detect the frequency of the data by examining the sequence of dates provided. If the day difference in subsequent dates is closest to

- \(1\), then a frequency of 365.25 is used for daily data
- \(7\), then a frequency of 52.17857 is used for weekly data
- \(30\), then a frequency of 12 is used for monthly data
- \(90\), then a frequency of 4 is used for quarterly data
- \(365\), then a frequency of 1 used for yearly data

It can also detect sub-daily data if the day difference in subsequent dates is closest to - \(\frac{1}{24}\), then a frequency of 8760 is used for hourly data - \(\frac{1}{60 \cdot 24}\), then a frequency of 525600 is used for minutely data - \(\frac{1}{60 \cdot 60 \cdot 24}\), then a frequency of 31536000 is used for secondly data

The numeric frequency reflects the number of periods in one year. Once the frequency is detected, a sequence of continuous dates is constructed to ensure evenly spaced data. If an observation is missing for any of the dates in the constructed sequence, it is treated as a missing value. However, if the dates provided contain only weekdays and the data is of daily frequency for example, the frequency is changed to \(365.25 \cdot \frac{5}{7} = 260.8929\) to reflect the number of weekdays per year. The same concept is applied to sub-daily data with weekday observations only as well. If none of the above frequencies are detected, then the frequency is set to the number of rows in the data to allow the algorithm to search for seasonal patterns. A flag `standard_freq`

will be set to `FALSE`

in the final output.

Before any of the model specification tests can be performed, the data must be seasonally and cycle adjusted. This procedure is performed by decomposing the time series into a loess trend and a seasonal-cycle remainder. The `loess`

function in the `stats`

package requires that there are no missing values in the data, so any missing values are computed using the Kalman filter with a local trend using the `stats`

package’s `StructTS`

and `KalmanSmooth`

functions. If you know the missing data pattern in your data (i.e missing data should be 0), it’s best to fill in the missing data using your knowledge before allowing this method to fill in the rest. A smoother cycle is also estimated from the seasonal-cycle component using a smoothing spline from the `smooth.spline`

function of the `stats`

package. Once the seasonal and cycle periods are known, the seasonal-cycle is more accurately decomposed into the seasonal and cycle components by first decomposing the seasonal-cycle using the `mstl`

function from the `forecast`

package then regressing the seasonal component on a Fourier series composed of the seasonal periods.

The first step is wavelet analysis for seasonal detection, which is done using a series of harmonic regressions in parallel on the detrended data. The detrended data may also be rescaled by the trend if the Cox-Stuart deviation test suggests that the detrended data does not have a constant amplitude. The regression is performed individually for each seasonal harmonic using the equation

\[ \tilde{y_t} = a_j sin(2\pi\frac{j}{f}) + b_j cos(2\pi\frac{j}{f}) \]

for harmonics \(j\) that correspond to the spectrum of minutely, hourly, workday hours, half daily, daily, weekend, weekday, weekly, monthly, quarterly, semi-yearly, and yearly seasonal frequencies where applicable. However, since this limits the number of frequencies tested, a linear space of 100 items is created between each harmonic so that the space captures 1% of the distance between the original two harmonics. If a nonstandard frequency is detected, the harmonics are set to \(j = \lfloor \frac{f}{2} \rfloor\), but only taking 1000 equally spaced harmonics for harmonics that have at least 3 cycles in the data. This strategy limits the number of harmonics tested for speed, while also trying to capture some a priori important seasonalities based on the frequency of the data.

At each iteration, an F-test is performed using the `waldtest`

function from the `lmtest`

package to measure the significance of the harmonic in explaining the seasonality, understanding that this will not be robust to heteroskedasticity or autocorrelation. This accuracy is sacrificed for speed at this step. The frequencies (defined as \(\frac{f}{j}\)) are then matched to the spectrum of seasonalities that correspond to the a priori important seasonal spectrum if the frequency of the harmonic is within one delta of the frequency of this predefined spectrum. This delta is equal to the unique difference between the frequencies of the harmonics.

Lastly, backward stepwise regression procedure using an F-test for the significance of each seasonality (i.e. joint significance of \(a_j\) and \(b_j\) is used to build the final seasonality specification starting with only the harmonics that were statistically significant from the first step. At each stage of the of the stepwise procedure, an F-test for the statistical significance of each seasonality is performed and any harmonics that are not statistically significant are removed. The process is performed iteratively until all remaining harmonics are statistically significant. The F-tests for this procedure are calculated using HAC standard errors to be robust to heteroskedasticity and autocorrelation to reduce the chance of selecting spurious seasonalities.

Next, wavelet analysis for cycle detection is done using a series of harmonic regressions in parallel on the detrended data. The detrended data may also be rescaled by the trend if the Cox-Stuart deviation test suggests that the detrended data does not have a constant amplitude. The regression is performed individually for each harmonic using the equation

\[ \tilde{y_t} = a_j sin(2\pi\frac{j}{f}) + b_j cos(2\pi\frac{j}{f}) \]

for harmonics \(j = 0.01\) to \(j = 0.99\) for every 0.01 and truncated for harmonics that correspond to periods \(\in [ 2.5f, T]\) and harmonics < \(1/2\) where \(T\) is the length of the data.

At each iteration, an F-test is performed using the `waldtest`

function from the `lmtest`

package to measure the significance of the harmonic to explain the seasonality, understanding that this will not be robust to heteroskedasticity or autocorrelation. This accuracy is sacrificed for speed at this step. Once all harmonics have been tested, the algorithm looks for a global maximum from the power spectrum measured with the F-statistic that is statistically significant.

Lastly, a final regression is performed with only the harmonic that was selected from the first step. From this regression, the harmonic is kept only if the cycle is statistically significant (i.e. \(a_j\) and \(b_j\) are jointly significant) as measured by the F-test for the final harmonic regression. The F-test for this step is calculated using HAC standard errors to be robust to heteroskedasticity and autocorrelation to reduce the chance of selecting spurious seasonalities.

The cycle is set to the trigonometric form if a cycle is detected using wavelet analysis. Otherwise, it is set to the ARMA(p,q) specification.

The third step in the model construction is to determine if the model should be based on a multiplicative of additive model. However, after the seasonal and cycle steps are completed, a new model prior is created based on the seasonal and cycle periods detected. For a multiplicative model, the data must all be positive and satisfy one of two tests. The model is restricted to multiplicative of additive decomposition, rather than allowing the full range of Box-Cox transformations, so that the decomposition is easy to compare to the original time series. Additive makes the decomposition comparative in levels, multiplicative makes the decomposition comparative in percentages relative to the original time series, but Box-Cox transformations don’t have an easy interpretation like this. However, the user can Box-Cox transform the data prior to estimation and specify `multilicative = FALSE`

if that method is preferred.

First, is a test of a linear vs exponential trend. This test compares the log likelihood of 1) a linear regression of the seasonal and cycle adjusted data vs a linear trend, and 2) a linear regression of the logged seasonal and cycle adjusted data vs a linear trend using the `petest`

from the `lmtest`

package with heteroskedastic and autocorrelated standard errors using the `sandwich`

package’s `vcovHAC`

function. A log trend model is selected if and only if the linear model is rejected and the log model is not rejected.

Next, is a test for constant seasonal and cycle amplitudes using the Cox-Stuart deviation test on the detrended (and de-seasonalized or de-cycled) data. Then, if the seasonal component is determined to have changing amplitude, a multiplicative model is used. However, since this test is sensitive to outliers, outliers in the seasonal and cycle series are detected and replaced using the `forecast`

’s package `tsclean`

function without replacing missing values before the test is performed.

The final step is to determine the specification of the trend after adjusting for the cycle and seasonality detected from the prior steps. This mainly boils down to determining if the data is integrated of order 0, 1, or 2 (I(0), I(1), or I(2)). To determine the number of differences to make the data stationary, the ADF, PP, and KPSS tests are performed on the seasonal and cycle adjusted data using the `forecast`

package’s `ndiffs`

functions on the seasonal adjusted time series. However, since this test is sensitive to outliers, outliers in the seasonal and cycle adjusted series are detected and replaced using the `forecast`

’s package `tsclean`

function without replacing missing values before the test is performed.

The number of differences selected for the trend specification is the average number of differences across the test, rounded to the nearest digit. Next, the Cox-Stuart trend test is preformed on the seasonal and cycle adjusted data to determine if the data is trending. If the data is trending, the order of integration is set to at least 1.

If the data is determined to be

- I(2), then the “double-random-walk” specification is used
- I(1) with a trend, then the “random-walk-drift” specification is used
- I(1) without a trend, then the “random-walk” specification is used
- I(0) and no trend, then the “random-walk” trend specification with a deterministic trend is used to model a constant mean with an autoregressive component

To estimate the unobserved components (\(T_t, C_t, S_t\), and \(D_t\)), the Kalman filter is utilized. The Kalman filter is a forward two-stage procedure that is the optimal linear filter based on the multivariate normal distribution. The “forwardness” of the filter means that it uses only data up to time \(t\) to make an inference on the unobserved components at time \(t\) and does peak into the future to make that inference.

The first stage is the prediction stage, which makes predictions on the unobserved components based on information up to time \(t-1\). This stage is made up of four equations, the first is the prediction of the unobserved components based on its time series properties

\[ B_{t|t-1} = \bar{D} + F \beta_{t-1|t-1} \]

Second, is the prediction of the covariance matrix of the unobserved components

\[ P_{t|t-1} = F P_{t_1|t-1} F^{\prime} + Q \] Third, is the prediction error of the time series of interest \[ \eta_{t|t-1} + Y_t - (A + H \beta_{t|t-1} + BX) \]

And finally, we have the variance of the prediction error

\[ f_{t|t-1} = H P_{t|t-1} H^{\prime} + R \]

The second stage is the updating stage, which makes predictions based on all information up to time \(t\). It consists of three equations. The first equation is the prediction of the unobserved components based on the the full information set

\[ \beta_{t|t} = B_{t|t-1} + K_t \eta_{t} \] where \(K_t\) is the Kalman gain, which determines the optimal weight to give new information in making predictions about \(\beta_t\) and is the second equation

\[ K_t = P_{t|t-1} H^{\prime} f_{t|t-1}^{-1} \] The last equation is the updating of the covariance matrix of the unobserved components

\[ P_{t|t} = P_{t|t-1} + K_t H^{\prime} P_{t|t-1} \] The seven equations above, make up the full Kalman filter routine. If \(Y_t\) is missing for any observation, then

\[ B_{t|t} = B_{t|t-1} \\ P_{t|t} = P_{t|t-1} \\ K_t = 0 \\ f_{t|t-1} = \infty \]

Once the Kalman filter is applied to the data, a smoothing procedure can applied in the backward direction to make a better inference of the unobserved components based on the entire data set. Unlike the filter, the smoother does peak into the future to make an inference on the unobserved components at time \(t\). This procedure consists of only two equations.

The first equation updates the prediction of the unobserved components based on all the available information

\[ \beta_{t|T} = \beta_{t|t} + P_{t|t} F^{\prime} P_{t|t}^{-1} (\beta_{t+1|T} - \beta_{t+1|t}) \] The second equation updates the covariance matrix of the unobserved components based on all the available information

\[ P_{t|T} = P_{t|t} + P_{t|t} F^{\prime} P_{t+1|t}^{-1} (P_{t+1|T} - P_{t+1|t}) P_{t+1|t}^{-1 \prime} F P_{t|t}^{\prime} \]

The model above is estimated using MLE with box constraints. The log likelihood is given by

\[ ln(L(\theta)) = -\frac{1}{2} \sum_{t=1}^T \ln(|f_{t|t-1}|) - \frac{1}{2}\sum_{t=1}^T \eta_{t|t-1}^{\prime} f_{t|t-1}^{-1} \eta_{t|t-1} \] for all values that are not missing.

Initial parameter values are essential for finding a good model when using maximum likelihood estimation. To this end, the algorithm uses initial parameter values derived from a decomposition using the `mstl`

function, which acts like the model prior and is represented by

\[ Y_t^* = T_t^* + C_t^* + S_t^* + e_t^* \]

This function will decompose the time series into trend and seasonality. The seasonality is then further split into the seasonal harmonics by fitted a linear regression on the seasonal Fourier series and extracting the fitted components. Finally, the trend is further split into a smoother trend and the cycle by fitting a loess trend to the `mstl`

trend using the `loess`

function. The drift is then derived by taking the first difference of the trend.

These parameters are set initial to match to the theoretical variance of the model depending on the trend type. If the trend is set to “random-walk” then \(\sigma_t = sd(\Delta T_t^{*})\). If the trend is set to “double-random-walk” then \(\sigma_t = \sigma_d = sd(\Delta T_t^{*})/sqrt(2)\). If the trend is set to “random-walk-drift”, an AR(1) model is fit the drift prior, \(\sigma_d\) is set to the standard deviation of the model errors, \(\phi_d\) is the set to the estimated AR(1) parameter, and \(\sigma_t\) is set to \(\sqrt{var(\Delta T_t^*) - \frac{\sigma_d^2}{1 - \phi_d^2}}\) the theoretical variance if the trend is I(1).

If the cycle model is set to the ARMA(p,q) specification or a cycle is not detected, \(p\) and \(q\) are determined using the `forecast`

package’s `auto.arima`

function on the prior cycle. The estimated AR and MA components are then used as the initial parameters. If the cycle is model is set to the trigonometric specification, \(\lambda\) is set to be \(\frac{2\pi}{c}\) where \(c\) is the estimated cycle period. \(\phi_c\) is set to the maximum eigenvalue of an ARMA(2,0,1) process fit to the cycle prior as this is related to the speed of convergence towards the trend as \(\phi_c\) is in an AR(1) process. Finally, \(\sigma_c\) is set to the standard deviation of the model errors.

If seasonality is included, each \(\sigma_j\) is set to \(sd(\Delta S_t^*)/sqrt(h)\) where h is the number of seasonal harmonics so \(var(\Delta S_t^*) = \sum_j \sigma_j^2\)

Lastly, \(\sigma_e\) is set the the standard deviation of the remainder of the model prior, \(sd(e_t^*)\).

The initial values of the unobserved components must also be initialized. These values are not optimized but set in advance to aid in speed. The only value set to optimize is one value for the diagonal of the covariance matrix \(P_{0|0}\), which is set to the identity matrix. The initial values for the unobserved components are taken to be the first non-NA value for each series from the prior.

The box constraints act like priors and provide bounds on parameter estimates to ensure a reasonable model. The constraints include stationarity constraints for the trigonometric cycle, \(0 < \phi_c < 1\), autoregressive cycle \(0 < \sum \phi_{c,i} < 1\), moving average component \(0 < \sum \theta_{c,i} < 1\), drift \(-1 < \phi_d < 1\), as well as \(\sigma_x > 0\) for all variance parameters. Trend smoothness constraints

\[ \sigma_t + \sigma_d < \sigma_e \\ \sigma_t + \sigma_d < \sigma_c \\ \sigma_t + \sigma_d < \sum_j \sigma_j \]

are also included to ensure that the trend accounts for the least variability among all the components. It can be relaxed by setting `unconstrained = TRUE`

in `stsm_estimate`

.

If an interpolation model is specified, the model is redefined based on the values set in `interpolate`

and `interpolate_method`

. For all interpolation models,

\[ I_t = T_{t-1} + S_{t-1} + C_{t-1} \] where \(I_t\) is the interpolated series and a deterministic observation equation is used by setting \(\sigma_e = 0\). The only other that changes is the specification of the observation equation, which is determined by the interpolation method

- End of period, then \(Y_t = I_t\)
- Period average, then \(Y_t = \frac{1}{N}(I_t + \ldots + I_{t-N+1})\)
- Period sum, then \(Y_t = I_t + \ldots + I_{t-N+1}\)

where \(N\) is the period to interpolate over. For example, if the frequency of the data is quarterly and the user would like to interpolate the data to a monthly frequency, then \(N = 3\).

The interpolation frequency is set by specifying `interpolate`

to “quarterly”, “monthly”, “weekly”, or “daily” as long as that frequency is higher than the frequency of the data. Although, it is recommended to only interpolate one frequency higher than the frequency of the data.

Interpolation methods are defined by setting `interpolate_method`

to “eop” for end of period, “avg” for period average, and “sum” for period sum.

Interpolation errors will depend on the volatility of the time series as well as model misspecification. However, given the model, the end of period method will be more susceptible to errors due to volatility than either the average or sum methods.

Anomaly detection uses the estimated structural model and the recursive nature of the Kalman filter to detect observations that are outside a given threshold of the model’s predicted values for time \(t\) given the data and estimates of the unobserved components at time \(t-1\). The threshold is determined by the forecast error variance of the estimated structural model. Anomalies that are detected can then be replaced with the modeled predicted values or threshold if the user wants to replace outliers.

It is important to note that anomalies are detected by comparison to the model, which is based on normally distributed innovations. Thus, the interpretation of an anomaly here is a value that has an “x” percent chance of occurring based on the data being generated from normal innovations, where the “x” percent is determined by the threshold. Normal distributions may not be appropriate for all time series, but this method of anomaly detection could be used to tell the user about the validity of the normality assumption: i.e. if more than 1% of the sample is detected as an anomaly with a 99% threshold, then normality may not be the best assumption for innovations for that particular series.

`stsm_detect_anomalies`

will only plot if anomalies are detected and `plot = TRUE`

.

The model makes recursive predictions for the j-th step ahead using

\[ \hat{Y_{t+j}} = H F^j \beta_{t} \] where the j-step ahead forecast error variance is given by

\[ FEV = \left(\sum_{i=0}^{j-1} H F^i Q F^{i\prime} H^{\prime}\right) + R \]

Structural breaks are detected using the `breakpoints`

function from the `strucchange`

package. It attempts to detect structural breaks in the trend, cycle, and seasonalities based on the estimated structural model. These tests can show how well the stochastic nature of the model picks up on the time varying nature of the time series.

If the data has an upward (or downward trend), the algorithm will look for breaks in trend growth. Otherwise, it will look for breaks in the level of the trend. If the growth (or level) is not constant across the entire time series, then a break should be detected.

Structural breaks in the seasonality and cycle are detected by looking for changes in the mean absolute value of the seasonal series (an analog for the amplitude). To check from breaks in the amplitude, a simple test on the mean absolute value is performed as this is related to the amplitude via the relationship \(Avg(|x_t|) = \frac{2}{\pi}Amp\). If the amplitude is not stable across the entire series, then a break should be detected.

Seasonality is defined by a constant periodicity, but economic cycles are rarely constant in their periodicity. So, a second structural break test is used on the cycle to check for changes in the periodicity. To check for breaks in the periodicity, the cycle is modeled using the trigonometric structure described above (\(C_t = \phi_c cos(\lambda_t) C_{t-1} + \phi_c sin(\lambda_t) C_{t-1}^{*} = a C_{t-1} + b C_{t-1}^{*}\)) and looking for instabilities in the coefficients over the time sample. The periodicity is identified from the linear regression using the relationship \(period = \frac{2\pi}{tan^{-1}(\frac{b}{a})}\). If the coefficients of this model are not stable across the entire series, then a break should be detected.

This procedure can be time consuming, even with high performance parallel computing, so if the user is only interested in structural breaks for a particular component, the user can specify the `components`

arguments of `stsm_detect_breaks`

to be any combination of “trend”, “cycle”, and “seasonal”. `stsm_detect_breaks`

will only plot the breaks that are detected if `plot = TRUE`

.

Below is an example of how to use this package:

```
library(ggplot2)
library(gridExtra)
library(autostsm)
set.seed(1024)
#Daily data
= 365.25
freq
#Build the trend and drift
= c()
t = c()
m 1] = 100
t[1] = 1
m[= 0.1
sig_e = 1
sig_t = 0.1
sig_m = 0.01
sig_s for(i in 2:3000){
= 0.05 + 0.75*m[i-1] + rnorm(1, 0, sig_m)
m[i] = t[i-1] + m[i-1] + rnorm(1, 0, sig_t)
t[i]
}
#Build the seasonality including yearly and weekly
= sin(2*pi/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
s365 = (s365 - min(s365))/diff(range(s365))*(1.125 - 0.865) + 0.865
s365 = sin(2*pi/7*(1:length(t))) + rnorm(length(t), 0, sig_s)
s7 = (s7 - min(s7))/diff(range(s7))*(1.125 - 0.865) + 0.865
s7 = s365 + s7
s = (s - min(s))/diff(range(s))*(1.25 - 0.75) + 0.75
s
#Build the cyclicality using every 3 years periodicity
= sin(2*pi*0.33/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
c = (c - min(c))/diff(range(c))*(1.25 - 0.75) + 0.75
c
#Build the data using a multiplicative model
= data.table(date = as.Date("2016-01-01") + 1:length(t),
ts y = t*c*s*exp(rnorm(length(t), 0, sig_e)),
trend = t, seasonal = s, seasonal7 = s7,
seasonal365 = s365, cycle = c)
#Create some missing values
sample(1:nrow(ts), round(0.05*nrow(ts))), "y" := NA]
ts[
#View the data
= ggplot(melt(ts, id.vars = "date", measure.vars = c("y", "trend"))) +
g1 labs(title = "Observed vs Trend") +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
= ggplot(melt(ts, id.vars = "date", measure.vars = c("cycle"))) +
g2 labs(title = "Cycle") +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
= ggplot(melt(ts, id.vars = "date", measure.vars = colnames(ts)[grepl("seasonal", colnames(ts))])) +
g3 labs(title = "Seasonal") +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
grid.arrange(g1, g2, g3, layout_matrix = rbind(c(1, 1), c(2, 3)))
#Estimate the model
= stsm_estimate(ts[, c("date", "y"), with = FALSE], verbose = TRUE)
stsm
#Forecast and plot the results
= stsm_forecast(stsm, y = ts[, c("date", "y"), with = FALSE], n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc
#Detect anomalies
= merge(stsm_fc,
stsm_fc stsm_detect_anomalies(stsm, y = ts[, c("date", "y"), with = FALSE], plot = TRUE),
by = "date", all = TRUE)
#Detect structural breaks
= merge(stsm_fc,
stsm_fc stsm_detect_breaks(stsm, y = ts[, c("date", "y"), with = FALSE], plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
```

```
library(autostsm)
##### Unemployment rate examples #####
#Not seasonally adjusted
data("UNRATENSA", package = "autostsm") #From FRED
= data.table(UNRATENSA, keep.rownames = TRUE)
UNRATENSA colnames(UNRATENSA) = c("date", "y")
"date" := as.Date(date)]
UNRATENSA[, "y" := as.numeric(y)]
UNRATENSA[, = stsm_estimate(UNRATENSA, verbose = TRUE)
stsm = stsm_forecast(stsm, y = UNRATENSA, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_fc stsm_detect_anomalies(stsm, y = UNRATENSA, plot = TRUE),
by = "date", all = TRUE)
= merge(stsm_fc,
stsm_fc stsm_detect_breaks(stsm, y = UNRATENSA, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
#Seasonally adjusted
data("UNRATE", package = "autostsm") #From FRED
= data.table(UNRATE, keep.rownames = TRUE)
UNRATE colnames(UNRATE) = c("date", "y")
"date" := as.Date(date)]
UNRATE[, "y" := as.numeric(y)]
UNRATE[, = stsm_estimate(UNRATE, verbose = TRUE)
stsm = stsm_forecast(stsm, y = UNRATE, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_fc stsm_detect_anomalies(stsm, y = UNRATE, plot = TRUE),
by = "date", all = TRUE)
= merge(stsm_fc,
stsm_fc stsm_detect_breaks(stsm, y = UNRATE, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
##### GDP examples #####
#Not seasonally adjusted
data("NA000334Q", package = "autostsm") #From FRED
= data.table(NA000334Q, keep.rownames = TRUE)
NA000334Q colnames(NA000334Q) = c("date", "y")
"date" := as.Date(date)]
NA000334Q[, "y" := as.numeric(y)]
NA000334Q[, = NA000334Q[date >= "1990-01-01", ]
NA000334Q = stsm_estimate(NA000334Q, verbose = TRUE)
stsm = stsm_forecast(stsm, y = NA000334Q, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_fc stsm_detect_anomalies(stsm, y = NA000334Q, plot = TRUE),
by = "date", all = TRUE)
= merge(stsm_fc,
stsm_fc stsm_detect_breaks(stsm, y = NA000334Q, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
#Seasonally adjusted
data("GDP", package = "autostsm") #From FRED
= data.table(GDP, keep.rownames = TRUE)
GDP colnames(GDP) = c("date", "y")
"date" := as.Date(date)]
GDP[, "y" := as.numeric(y)]
GDP[, = GDP[date >= "1990-01-01", ]
GDP = stsm_estimate(GDP, verbose = TRUE)
stsm = stsm_forecast(stsm, y = GDP, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_fc stsm_detect_anomalies(stsm, y = GDP, plot = TRUE),
by = "date", all = TRUE)
= merge(stsm_fc,
stsm_fc stsm_detect_breaks(stsm, y = GDP, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
##### S&P 500 example #####
data("SP500", package = "autostsm") #From FRED
= data.table(SP500, keep.rownames = TRUE)
SP500 colnames(SP500) = c("date", "y")
"date" := as.Date(date)]
SP500[, "y" := as.numeric(y)]
SP500[, = stsm_estimate(SP500, verbose = TRUE)
stsm = stsm_forecast(stsm, y = SP500, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_fc stsm_detect_anomalies(stsm, y = SP500, plot = TRUE),
by = "date", all = TRUE)
= merge(stsm_fc,
stsm_fc stsm_detect_breaks(stsm, y = SP500, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
```

```
library(autostsm)
library(lubridate)
#Rebuild the data
set.seed(1024)
= data.table(date = as.Date("2016-01-01") + 1:length(t),
ts y = t*c*s*exp(rnorm(length(t), 0, sig_e)))
#Convert the data to monthly by using the end of period
"mnth" := floor_date(date, "month")]
ts[, "rn" := .N:1, by = "mnth"]
ts[, = ts[rn == 1, c("mnth", "y"), with = FALSE]
ts colnames(ts)[colnames(ts) == "mnth"] = "date"
#Get the quarter and set the date to be the end of the quarter
"qtr" := ceiling_date(date, "quarter") %m-% months(1)]
ts[, "y_avg" := frollmean(y, n = 3, align = "right")]
ts[, "y_sum" := frollsum(y, n = 3, align = "right")]
ts[,
#We already know the data has a multiplicative structure with yearly seasonality and
#the frequency below will be set to quarterly,
#so we will set seasons = 4 and multiplicative to TRUE to help the model with identification
#since the data set goes from 3000 daily observations to 99 monthly observations and then to
#33 quarterly observations
#End of period
"rn" := .N:1, by = "qtr"]
ts[, = ts[rn == 1, ][, "rn" := NULL]
ts.eop = stsm_estimate(y = ts.eop[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
stsm interpolate = "monthly", interpolate_method = "eop", verbose = TRUE)
= stsm_filter(stsm, y = ts.eop[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
stsm_fc c("date", "y")], by = "date", all.x = TRUE, all.y = TRUE)
ts[, = cbind(method = "eop", stsm_fc[is.na(observed), .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))])
eop.err
#Period average
= ts[, .(y = mean(y, na.rm = TRUE)), by = "qtr"]
ts.avg = stsm_estimate(y = ts.avg[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
stsm interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
= stsm_filter(stsm, y = ts.avg[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
stsm_fc c("date", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
ts[, if(stsm$multiplicative == TRUE){
"interpolated" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
stsm_fc[, else{
}"interpolated" := frollmean(interpolated, n = 3, align = "right")]
stsm_fc[,
}= cbind(method = "avg", stsm_fc[is.na(observed), .(mape = mean(abs(y_avg - interpolated)/(1 + y_avg)*100, na.rm = TRUE))])
avg.err
#Period sum
= ts[, .(y = sum(y, na.rm = TRUE)), by = "qtr"]
ts.sum = stsm_estimate(y = ts.sum[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
stsm interpolate = "monthly", interpolate_method = "sum", verbose = TRUE)
= stsm_filter(stsm, y = ts.sum[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
stsm_fc c("date", "y_sum")], by = "date", all.x = TRUE, all.y = TRUE)
ts[, if(stsm$multiplicative == TRUE){
"interpolated" := exp(frollsum(log(interpolated), n = 3, align = "right"))]
stsm_fc[, else{
}"interpolated" := frollsum(interpolated, n = 3, align = "right")]
stsm_fc[,
}= cbind(method = "sum", stsm_fc[is.na(observed), .(mape = mean(abs(y_sum - interpolated)/(1 + y_sum)*100, na.rm = TRUE))])
sum.err
#Errors depend on the volatility of the series
#the end of period method is more susceptible to the volatility of the series than either
#the average or sum methods
rbind(eop.err, avg.err, sum.err)
```