In my last post I discussed how linear regression can be used to predict on time series analysis and used a time series dataset that had the price of shampoo ranging over a three year period in the early 20th century to illustrate my point. In fact, before Facebook Prophet became open source in 2017 I believe this model was used as an alternative to Python’s statsmodels library.
The linear regression model can be made from scratch using Python code or found in the sklearn or statsmodels library. In my most recent post I elected to use the LinearRegression() function from Python’s sklearn library, with the post being found here:- https://medium.com/ai-in-plain-english/linear-regression-can-be-used-to-predict-sales-of-shampoo-548d4a1e1efd
Another model that is said to be useful when making predictions on a time series dataset is Random Forest, so I decided to try that model out to see if I would get better results than by using linear regression. I chose RandomForestRegressor() from sklearn’s library because I felt the series dataset was more of a regression issue.
I created my program in Google Colab, which is a free online Jupyter Notebook that enables me to code without having to install linux, Python or any libraries on the computer I used for this project.
Once I created the program, I only needed to import the libraries I would need, and in this instance it was pandas, matplotlib and sklearn.
Once pandas in particular had been imported, I read the shampoo csv file into the program:-
I then checked for any null values, and on this occasion there were none that needed to be imputed.
Because the model will not accept a date, I used lambda to create a small function that would convert the date to a number from the Gregorian calendar and stor it in another column newly created for that purpose:-
Once the date had been converted to a number, I had no further need for the Month column, so dropped it from the dataframe:-
I then plotted the numeric date and sales on a graph using the matplotlib library, and it can be visually seen that the sales were increasing with time:-
I defined the X and y variables, which will be needed when preparing the data to be forecasted on. The X variable consists of the numeric Gregorian date and the y variable is the shampoo sales.
Once the X variable had been defined, I normalised the values so that all of them fall between 0 and 1:-
When the data preprocessing was complete, I split the data up for training and validation, with the validation set being 10% of the dataset.
I then had to reshape the train and val dataframes in order to make them compatible with the model:-
I defined the model, in this instance being RandomForestRegressor() from the sklearn library. I set verbose=True because I wanted to see what actions the model was taking to carry out the training process.
Once the data was trained and fitted into the model, I predicted on the validation set. I achieved a mean squared error of 28,495 using this model:-
As can be seen from the table below, the error achieved using Random Forest is higher than the error achieved using linear regression:-
Once the model had predicted on the validation set, I put the values on a graph. It can be seen from the illustration below the prediction was nothing more than a straight line:-
In summary, linear regression has less error than Random Forest when endeavouring to predict on a univariate time series dataset.
The code for this post can be found in its entirety in my personal GitHub account, the link being here:- Shampoo/Random_Forest_Time_Series_Shampoo.ipynb at main · TracyRenee61/Shampoo (github.com)