- Introduction
- Model Comparison
- Summary
- References
Predicting the stock market can be hard, or so we thought. As we say this over and over again, we get it in our heads that it will always be difficult to do in the future. However, technology is always advancing, and Machine Learning is no exception. When Data Science models are used to predict future stock prices, it is important to analyze the error of specific stocks versus one another, as well as analyze when certain days, weeks, months, or years are harder to predict. Just like diving deeper into trends that allow us to be able to predict, so should we explore how our predictions stack up against one another. More data is generally better in a time series problem, and the more we see these anomalies, the more they can be turned into patterns that the algorithm can recognize.
Although certain events have led us to believe that these are the first time we are seeing X event, there are most likely a lot of other smaller events that are similar that can be predicted given the features of our data. Some algorithms perform okay, and some perform better, so instead of manually testing all algorithms, I am going to use an outstanding Machine Learning library named PyCaret. This library automatically ranks the error metrics of the most common algorithms so that you can decide which one to save, deploy, and predict with.
With just a few lines of code, I will be able to very accurately predict the stock market, and so can you — however, this is just a tutorial and not advice.
I will be discussing the Python library that I used, along with the data, parameters, models compared, results, and code below.
Library
Using the power of PyCaret [3], you can now test every popular Machine Learning algorithm against one another. For this problem, I will be comparing MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec) — the time it takes for the model to be completed. Some of the benefits of using PyCaret overall, as stated by the developers, is that there is increased productivity, ease of use, and business-ready — all of which I can personally attest to myself.
Data
The dataset [4] that I am using is from Kaggle. You can download it easily and quickly. It consists of 19MB along with the S&P 500 stock data from 5 years of data. It also has individual data for each stock. It consists of 619,040
rows and 7
columns.
Columns:
- date
- open
- high
- low
- close
- volume
- Name
Parameters
These are the parameters that I used in the setup()
of PyCaret. The Machine Learning problem is a regression one, including data from the stock market, with the target
variable being the close
field. For reproducibility, you can establish a session_id
. There are a ton more parameters, but these are the ones that I used, and PyCaret does a great job of automatically detecting information from your data — like picking which features are categorical, and it will confirm that with you in the setup()
. Since this is a time series problem, meaning there is ordered data with dates, we have to make sure that if we are using a fold_strategy
, it needs to be designated as timeseries
.
I will be comparing 19 Machine Learning algorithms, some are incredibly popular while some, I have actually not heard of, so it will be interesting to see which one wins with this dataset. It is important to keep in mind what winning means to you or your company — this will be your error metric most likely. For the sake of the business case use, I will be using MAE or mean absolute error as my metric for deciding which model ultimately wins. Another important factor to consider is the time it takes to train the model. Some of these took a long time, as you will see.
Here are all of the models that I compared:
- Linear Regression
- Lasso Regression
- Ridge Regression
- Elastic Net
- Orthogonal Matching Pursuit
- Bayesian Ridge
- Gradient Boosting Regressor
- Extreme Gradient Boosting
- Random Forest Regressor
- Decision Tree Regressor
- CatBoost Regressor
- Light Gradient Boosting Machinee
- Extra Trees Regressor
- AdaBoost Regressor
- K Neighbors Regressor
- Lasso Least Angle Regression
- Huber Regressor
- Passive Aggressive Regressor
- Least Angle Regression
Results
Alright, the fun part, comparing the models. As you can see in the screenshot below, there are pros and cons to the models, as always. Some models look straight up inaccurate, while some have an incredibly low error. The time it takes to train each model was extremely variable as well. So, which one won in our case? I am going to say Orthogonal Matching Pursuit. To be honest, I have never heard of this algorithm before playing around with PyCaret — but it is essentially a linear model — and that is why it is important to test more than just the algorithms you know and to research more. It beat all models in MAE, MSE, RMSE, and R2 (tied for best). It is important to note that it did not win RMSLE or MAPE. However, it was still one of the fastest training times as well, which can be beneficial in production (and in testing).
Because the units of the target variable were dollars, we can say that the mean absolute error was 50 cents. The close
value is what we were trying to predict (actual), and the Label
value is the prediction. Overall, this model on this data performed very well!
Code
Here is the Python code that you can try testing yourself from importing libraries, reading in your data, setting up your regression, comparing models, and making predictions [8]:
# import libraries
from pycaret.regression import *
import pandas as pd# read in your stock data
stocks_df = pd.read_csv(‘file location of your data on your computer.csv’)# setup your regression parameters
regression = setup(data = stocks_df,
target = ‘close’,
fold_strategy = ‘timeseries’,
session_id = 100,
)# compare models
compare_models()# predict on test set
predictions = predict_model(omp)
With stocks previously viewed as difficult to predict, we have shown that with little code, we can compare nearly 20 models and rank them on success criteria to pick a final, best model with low error. With these five years of data, we were able to predict the stock market by 50 cents on close. Some next steps would be to analyze how we did on data that would be considered an anomaly, we might be only good at predicting normal data, but could be inaccurate at predicting anomalies — which could be argued is the most beneficial information to know about the stock market. That being said, perhaps this model is better at performing in the long term versus the short term, but that is yet to be analyzed, so I carry off the duty to you, to predict the stock market — and to dive deeper into which stock Names
did not perform as well, and which dominated in terms of error loss.
To summarized, we used PyCaret to train several models, with ultimately a linear model, Orthogonal Matching Pursuit, using time-series functions, as well as regression, to predict the S&P500.
I want to give thanks and admiration to Moez Ali for developing this awesome Data Science library.
I hope you found my article both interesting and useful. Please feel free to comment down below if you applied to this library to predict the stock market, or if you have used a different technique. Do you prefer one over the other? What type of blockers do you have when predicting on time series data?
These are my opinions, and it is not financial advice — this is a common dataset and Data Science project that most Data Scientists complete in college. I am not affiliated with any of these programs.
Please feel free to check out my profile and other articles, as well as reach out to me on LinkedIn.
[1] Photo by Austin Distel on Unsplash, (2019)
[2] Photo by Austin Distel on Unsplash, (2019)
[3] Moez Ali, PyCaret, (2021)
[4] Cam Nugent on Kaggle, S&P 500 stock data, (2021)
[5] M.Przybyla, Dataframe Screenshot, (2021)
[6] M.Przybyla, Model Comparison Screenshot, (2021)
[7] M.Przybyla, Predictions Screenshot, (2021)
[8] M.Przybyla, Python Code, (2021)