Deep Learning Is Becoming Overused

Understanding the data is the first port of call

Source: Photo by PublicDomainPictures from Pixabay

There is always a danger when any model is used in a black-box fashion to analyse data, and models of the deep learning family are no exception.

Don’t get me wrong — there are certainly occasions where a model such as a neural network can outperform more simplistic models — but there are plenty of examples where this is not the case.

To use an analogy — suppose you need to buy a vehicle of some sort for transportation purposes. Buying a truck is a worthwhile investment if you regularly need to transport large items across long distances. However, it is a blatant waste of money if you simply need to go to the local supermarket to pick up some milk. A car (or even a bicycle if you are climate-conscious) is sufficient to carry out the task in question.

Deep learning is starting to be used in the same way. We are starting to simply feed these models with the relevant data, assuming that performance will surpass that of simpler models. Moreover, this is often done without properly understanding the data in question; i.e. recognising that deep learning would not be necessary if one had an intuitive grasp of the data.

I work most often with time series analysis, so let’s consider an example in this regard.

Suppose that a hotel is looking to forecast the average daily rate (or the average rate per day) that it charges across its customer base. The average daily rates for each customer are averaged on a weekly basis.

An LSTM model is configured as follows:

model = tf.keras.Sequential()
model.add(LSTM(4, input_shape=(1, lookback)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history=model.fit(X_train, Y_train, validation_split=0.2, epochs=100, batch_size=1, verbose=2)

Here is the predicted vs. actual weekly ADR:

Source: Jupyter Notebook Output

An RMSE of 31 is obtained relative to a mean of 160. The size of the RMSE (root mean squared error) is 20% of the size of the mean ADR. While the error is not excessively high — it is admittedly a little disappointing given that the purpose of a neural network is to outperform other models in terms of maximising accuracy.

Moreover, this particular LSTM model is a one-step forecast — meaning that the model cannot make long-range forecasts without having all data before time t available.

That said, have we gotten a bit ahead of ourselves in applying an LSTM model to the data right away?

Let’s bring the horse back before the cart and get an overall view of the data first.

Here is a 7-week moving average of the ADR fluctuations:

Source: Jupyter Notebook Output

We can see clear evidence of a seasonal pattern when the data is smoothed out over a 7-week moving average.

Let’s take a closer look at the autocorrelation function for the data.

Source: Jupyter Notebook Output

We can see that the peak correlation (after the series of negative correlations) is at lag 52, indicating that yearly seasonality is present in the data.

Using this information, an ARIMA model is configured using pmdarima to forecast the last 15 weeks of ADR fluctuations, with the p, d, q coordinates automatically selected to minimise the Akaike Information Criterion.

>>> Arima_model=pm.auto_arima(train_df, start_p=0, start_q=0, max_p=10, max_q=10, start_P=0, start_Q=0, max_P=10, max_Q=10, m=52, stepwise=True, seasonal=True, information_criterion='aic', trace=True, d=1, D=1, error_action='warn', suppress_warnings=True, random_state = 20, n_fits=30)Performing stepwise search to minimize aic
ARIMA(0,1,0)(0,1,0)[52]             : AIC=422.399, Time=0.27 sec
ARIMA(1,1,0)(1,1,0)[52]             : AIC=inf, Time=16.12 sec
ARIMA(0,1,1)(0,1,1)[52]             : AIC=inf, Time=19.08 sec
ARIMA(0,1,0)(1,1,0)[52]             : AIC=inf, Time=14.55 sec
ARIMA(0,1,0)(0,1,1)[52]             : AIC=inf, Time=11.94 sec
ARIMA(0,1,0)(1,1,1)[52]             : AIC=inf, Time=16.47 sec
ARIMA(1,1,0)(0,1,0)[52]             : AIC=414.708, Time=0.56 sec
ARIMA(1,1,0)(0,1,1)[52]             : AIC=inf, Time=15.98 sec
ARIMA(1,1,0)(1,1,1)[52]             : AIC=inf, Time=20.41 sec
ARIMA(2,1,0)(0,1,0)[52]             : AIC=413.878, Time=1.01 sec
ARIMA(2,1,0)(1,1,0)[52]             : AIC=inf, Time=22.19 sec
ARIMA(2,1,0)(0,1,1)[52]             : AIC=inf, Time=25.80 sec
ARIMA(2,1,0)(1,1,1)[52]             : AIC=inf, Time=28.23 sec
ARIMA(3,1,0)(0,1,0)[52]             : AIC=414.514, Time=1.13 sec
ARIMA(2,1,1)(0,1,0)[52]             : AIC=415.165, Time=2.18 sec
ARIMA(1,1,1)(0,1,0)[52]             : AIC=413.365, Time=1.11 sec
ARIMA(1,1,1)(1,1,0)[52]             : AIC=415.351, Time=24.93 sec
ARIMA(1,1,1)(0,1,1)[52]             : AIC=inf, Time=21.92 sec
ARIMA(1,1,1)(1,1,1)[52]             : AIC=inf, Time=30.36 sec
ARIMA(0,1,1)(0,1,0)[52]             : AIC=411.433, Time=0.59 sec
ARIMA(0,1,1)(1,1,0)[52]             : AIC=413.422, Time=11.57 sec
ARIMA(0,1,1)(1,1,1)[52]             : AIC=inf, Time=23.39 sec
ARIMA(0,1,2)(0,1,0)[52]             : AIC=413.343, Time=0.82 sec
ARIMA(1,1,2)(0,1,0)[52]             : AIC=415.196, Time=1.63 sec
ARIMA(0,1,1)(0,1,0)[52] intercept   : AIC=413.377, Time=1.04 secBest model:  ARIMA(0,1,1)(0,1,0)[52]          
Total fit time: 313.326 seconds

According to the output above, ARIMA(0,1,1)(0,1,0)[52] is the configuration that is the model of best fit according to AIC.

Using this model, an RMSE of 10 is obtained relative to the mean ADR of 160.

This is a lot lower than the RMSE achieved by the LSTM (which is a good thing) and accounts for just over 6% of the size of the mean.

Through proper analysis of the data, one would recognise that the presence of a yearly seasonal component in the data makes the time series more predictable — and use of a deep learning model to try to forecast such a component would be largely redundant.

Let’s take a different spin on the above problem.

Instead of trying to forecast the average weekly ADR, let’s now try and predict an ADR value for each customer.

Two regression-based models are used for this purpose:

Linear SVM (support vector machine)
Regression-based neural network

The following features are used in both models to predict an ADR value for each customer:

IsCanceled: whether the customer cancels their booking or not
country: the customer’s country of origin
marketsegment: market segment of the customer
deposittype: whether the customer has paid a deposit or not
customertype: type of customer
rcps: required car parking spaces
arrivaldateweekno: week of arrival

Using the mean absolute error as the performance measure, let’s compare the obtained MAE relative to the mean across both models.

Linear SVM

A LinearSVR with an epsilon of 0.5 is defined and trained across the training data:

svm_reg_05 = LinearSVR(epsilon=0.5)
svm_reg_05.fit(X_train, y_train)

Predictions are now made using the feature values in the test set:

>>> svm_reg_05.predict(atest)array([ 81.7431138 , 107.46098525, 107.46098525, ...,  94.50144931,
94.202052  ,  94.50144931])

Here is the mean absolute error relative to the mean:

>>> mean_absolute_error(btest, bpred)
30.332614341027753>>> np.mean(btest)
105.30446539770578

The MAE is 28% of the size of the mean.

Let’s see if a regression-based neural network can do any better.

Regression-based neural network

The neural network is defined as follows:

model = Sequential()
model.add(Dense(8, input_dim=8, kernel_initializer='normal', activation='elu'))
model.add(Dense(2670, activation='elu'))
model.add(Dense(1, activation='linear'))
model.summary()

The model is then trained across 30 epochs using a batch size of 150:

model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
history=model.fit(xtrain_scale, ytrain_scale, epochs=30, batch_size=150, verbose=1, validation_split=0.2)
predictions = model.predict(xval_scale)

With the features from the test set now fed into the model, here are the MAE and mean values:

>>> mean_absolute_error(btest, bpred)
28.908454264679218>>> np.mean(btest)
105.30446539770578

We see that the MAE is only slightly lower than that achieved using the SVM. In this regard, it is hard to justify the use of a neural network in predicting customer ADR when the linear SVM model showed virtually the same level of accuracy.

In any event, factors such as the choice of features used to “explain” ADR are of more relevance than the model itself. As the saying goes, “garbage in, garbage out”. If feature selection is poor, then the output of the model will also be poor.

In this case, while both regression models have shown a degree of predictive power, it is quite possible that either 1) selection of other features in the dataset could improve accuracy further, or 2) there is simply too much variation in ADR that can be accounted for by the features in the dataset. For instance, the dataset tells us nothing about factors such as income level for each customer, which would be expected to significantly influences their average spend per day.

In the two examples above, we have seen that use of “lighter” models have been able to match (or surpass) the accuracy achieved by deep learning models.

While there are cases where data can be quite complex as to require an algorithm learning patterns in the data “from scratch”, this tends to be the exception rather than the rule.

As with any data science problem, the key is firstly in understanding the data one is working with. The choice of model is secondary.

Many thanks for your time, and any questions or feedback are greatly appreciated!

The datasets and Jupyter notebooks for the above examples can be found here.

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

Understanding the data is the first port of call

Linear SVM

Regression-based neural network

Footer