How to generate an AI Alpha Factor in Python

In Section 3 of the previous post, I used the following three factors (zscore of the rank) and simply took the average to calculate the final alpha signal.

Momentum 1 Year Factor
Mean Reversion 5 Day Sector Neutral Smoothed Factor
Overnight Sentiment Smoothed Factor

Now, taking these as a part of features for machine learning, let’s consider some more features.

Universal Quant Features

Add some features to capture the universe of stocks:

Annualized Volatility over 20 days, 120days
Average Dollar Volume over 20days, 120days
Sector

Regime Features

Add some features to capture the market-wide regimes:

High and low Volatility over 20 days, 120days
High and low Dispersion 20days, 120days

Date Features

Make columns for the trees to split on that might capture trader/investor behavior due to calendar anomalies.

One Hot Encoding

The sector value is 0 to 10, which are just labels and the number itself does not hold meaning. If we use this as a feature, the model will consider the value as numerical input — e.g. 3 is larger than 2 and smaller than 5, 1+6=7, which is not true. In this case, we can apply one hot encoding. pandas DataFrame function get_dummies or OneHotEncoder from sklearn preprocessing are useful, which basically do this:

Let the model predict the forward 1-week return based on the above features. Although his can be solved by either regression or classification, fitting a model to the exact return value sounds a bit too much. Thus, take a quantised return and fit a classifier to each bucket, then apply some weight to each bucket to synthesise the alpha factor.

Target Label (created by author)

Now the features and the target is ready, a normal machine learning model can be applied.

Split the data

Data should not be shuffled the train, validation and test sets should be taken from the oldest to newer as we have to use past data for future prediction. The following piece of code does this split into 60% train, 20% validation and 20% test set in the same way as train_test_split() from sklearn.

Training

Use a tree based classifier such as Random Forest as it has better explainability. Tune the parameter based on the validation score with minimum sample leaf ≥ the number of stocks in the universe.

Although the accuracy is relatively low, still better than taking the average by equal or random weight. You can visualise the decision tree and feature importance to understand the logic behind the model:

Decision Tree (created by author)

Feature Importance (created by author)

Once the prediction is made on the quantised label, apply a weight to generate a synthesised alpha vector on which the performance is measured.

The performance is measured by alphalens in the same way as explained in the previous post.

Factor Returns (created by author)

Quantile Factor Returns (created by author)

Factor Rank Autocorrelation (created by author)

Ensemble of non-overlapping trees

Overlapping samples tend to result in overfitting and not performing well in production. Use the non-IID (i.e. Independent and Identically Distributed) labels to mitigate likely overfitting.

Test

Finally, once all the parameter has been fixed, test the performance on unseen test set. As is always the case for any machine learning, parameters should not be changed based on the performance of this test to avoid the leakage.

First, roll forward the training to “current day” in production, so re-train the model on train & validation dataset. Then, the use X_test to see the performance on test set.