Use Correlation to predict Market Index

Market Index consists of a list of major companies stock price. There should be a correlation between their prices. Here I would like to use the Machine Learning Model (LSTM) to predict the market index with the historical data of certain stocks.

I use the python package yfinance to get the daily stock price. I downloaded 3-year figures including “Open”, “Close”, “High”, “Low”, and “Volume”

The target variable of the prediction is the Index ETL close price rate of return of the next day, which is defined as

target(t) = ( Close(t+1) - Close(t) )/Close(t) * 100%

Since the value of the stock volume is too large, I transformed it into:

log-volume = log(volume+1)

p.s. adding +1 to avoid log of zero.

Since it is assumed that the price of the index ETL will depends on the historical stock price, I used the figures: “Open”, “Close”, “High”, “Low”, and “Volume” to construct the features.

For each column in [“Open”, “Close”, “High”, “Low”, “Volume”], I computed the 5-day lag, which is the previous-i-day figure (i ranges from 1–5).
Compute the lag-i-return by:

lag-i-return(t) = ( value(t) - value(t-i) )/ value(t-i) * 100%

3. Construct PCA transform for all the lag-i-values in [“Open”, “Close”, “High”, “Low”, “Volume”], totally 5*5 features.

4. Use the first 3 PCA components as the final features because the first 3 components already explained over 80% of the total variance.

I feed the 3 PCA features and the target variable into the LSTM model, which is a common Recurrent Neural Network for the time series. We downloaded 3-year data, and use about 80% for model training, and 20% unseen data for model testing. PyTorch-Lightning is the ML package we used to code the LSTM model. You can find the code in my Git.

I used the package Ray Tune for the hyper-parameter tuning of the pytorch model. The hyper-parameters includes:

sequence length of the time series
no. of hidden states in the LSTM layer
batch size for the model training
dropout rate for the LSTM output
learning rate (lr) for model training
no. of LSTM layers

"seq_len": tune.choice([5, 10]),
"hidden_size": tune.choice([10, 50, 100]),
"batch_size": tune.choice([30,60]),
"dropout": tune.choice([0.1, 0.2]),
"lr": tune.loguniform(1e-4, 1e-1),
"num_layers": tune.choice([2, 3, 4])

As I have limited computing resources, I only chose 10 stocks to predict the market Index. For each of the 10-stock, I joined the historical price with the market index, and then run the model training, testing, and hyper-parameter tuning.

I compared the HK market (HSI), and US market (NASDX).

For HSI, the stocks chosen are:

https://github.com/iwasnothing/IndexCorTrade/blob/main/hsi.csv

Here is the final result.

Footer