This is the 1st part of a 2 part series that discusses how I made a Twitter Bot that makes predictions on who is going to win a T20 match. In the first part, we will discuss the machine learning aspect of the bot, from getting the data, transforming it, and training the model. In part two we will discuss the Twitter bot, how it gets the match data, and tweets the prediction in real-time.
My cricket fever is still at an all-time high, there is something about Pakistan Super League that brings out the Crazy Cricket fan in me. After scrapping ball by ball data of all previous PSL seasons (You can read that here) and then running some basic data analysis and creating visualization on that data set ( you can read that here), I wanted to do more with the dataset. Then it hit me, why not train a model that would predict the likelihood of a team winning the match.
Those who watch cricket would know that most modern-day broadcasts have a win predictor that makes a prediction on what percentage chance a certain team has of winning. Looking at such a predictor on the broadcast of PSL made me think of making my own
The predictor I have made is VERY BASIC. It does not account for a lot of factors that make a huge impact on the match, such as the venue, pitch, weather, form of players, and teams. All it looks at is how many runs are needed, how many balls are in hand, and how many wickets have fallen. Owing to my busy schedule, making a more detailed predictor is not really feasible and I was more interested in the whole process of making one than the actual results.
What I wanted to teach my model is, given a certain scenario, what are the chances that the team chasing will win. For this, I needed ball by ball data for matches, such that I would know at any given ball of the second inning, what the score was, and whether the team chasing won or not.
I started with the PSL dataset that I scrapped myself ( link to the post). This contained ball by ball details for both the innings for all matches of PSL. I trained the first model on this dataset, but since PSL is relatively new and each season has a maximum of 34 matches, this dataset was too small(33962 rows total).
Next, I decided to look for another dataset. I came across this dataset on Kaggle which contained ball by ball detail for each match of the Indian Premier League. IPL has been running for over 10 years, has far more teams and far more matches per season than PSL. This gave me a somewhat adequate dataset (179080 rows).
For solving a machine learning problem, the first step is to gather a dataset, the next is to clean it and transform it into a form that you can feed to your machine learning algorithm. Now with the data downloaded, I needed to transform it.
The predictor will only make predictions in the second inning of the match when the target has been set and the chasing team has started batting. My dataset contains ball by ball data for both innings. It also does not contain the target. Moreover, the IPL dataset did not contain values for who won the match in their ball by ball dataset.
The first step was to calculate the first inning total for each match, this would give us the target that the second team had to chase. With pandas, we can easily achieve this.
We first generate a data-frame that consists only of first innings data. We then group by the match id and aggregate the run column by applying a sum
first_inning_total = first_inning.groupby("match_id").agg(
target = ("total_runs","sum")
)
This will sum the first-inning runs for each match and give us the total runs scored in the first inning, which is the target the second team has to chase.
The IPL dataset came with two tables, ball by ball details table, and a match summary table. Who won the match was given in the match summary table. So by joining both tables on the match_id we were able to get the name of the winner of each match along with its ball by ball detail.
Now we have, the target, ball by ball data for the second inning, and who won the match. We simply apply some more simple dataset-specific transformation to calculate things such as the number of fallen wickets, run scored, target, overs, and balls left e.t.c and we have our dataset ready. (All of the code is on GitHub and the link is shared at the end of this article)
By combining both the IPL and PSL dataset, keeping second inning data only we end up with 102560 rows that we can train our model on.
As you can see in the image above, we are just using wickets
balls_left
and runs_left
as our features for the model. In a proper model, we would have a lot more features. Who is on the crease, who is bowling, type of pitch, weather, the form of both the teams and so on? The possibilities of the available features are quite large, this is a very simple bot based on the simple feature.
Initially, the first model I trained also used more features. The features set included runs_scored,
overs,
ball,
and target
. The performance of the model I trained on these features wasn’t that great in real life. Even though the accuracy on the test set was over 80%, in real-life performance the model did not make good predictions. The model was being swayed a lot by wickets, even with 5 runs required in the last over with only 5 wickets down, the model was predicting the bowling side would win. If you change the wickets from 5 to 4, the model would predict the batting side would win. It seemed like the model had learned that the higher the wicket number, the lower the chances of winning, while it seemed to not get the relationship between target and score. So I decided to simplify the features. Instead of target and score as separate features, I used runs left (by subtracting the score from target) instead of using over and balls, I used the feature balls_left.
With these new features, the model seemed to perform much better in real-life conditions. The model was able to understand that the lower the run left, the more the chance of the team batting winning, the higher the balls_left higher the chance, and the lower the number of wickets, the higher is the chance of a team winning, and the relationship between these features.
So by using smaller and simpler features, our model was able to perform model.
The problem can be simplified to simple binary classification. All we need to know is whether the team chasing won or not. So we can simply predict 0 or 1, 0 being the team lost and 1 being the team won.
The first step is to divide our dataset into X and y, where X is the feature and y is the result.
y = full_df.iloc[:,3]
X = full_df.iloc[:,:3]
Next is to split the dataset into test and train sets.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42)
For the algorithm, since this is a simple binary classification, I just decided to use a Random Forest Classifier.
So the training can be run with the following code
RF = RandomForestClassifier(n_estimators=10000, max_depth=6, random_state=0)
RF.fit(X_train, y_train)
This will run training on the dataset. The training is quite quick and took a little over 3 minutes on my laptop.
Once the model has been trained, we can evaluate it on our test dataset.
y_pred_test = RF.predict(X_test)
accuracy_score(y_test, y_pred_test)
We got a score of 0.77 which for such a basic model seems good enough.
To get a prediction, all we have to do is pass it a data frame in which each row contains the features we have trained our model on and call the predict method on the model file.
current= {
"wickets":1,
"balls_left" :105,
"runs_left":137
}
current_df = pd.DataFrame(current,index=[0])
RF.predict(current_df)
The predict method returns the label the model has predicted(0 or 1 in our case), we though want to predict percentage chances of a win, so we can use the `predict_proba` method instead, this returns the probability of either of the labels being the outcome.
RF.predict_proba(current_df)
array([[0.41718799, 0.58281201]
This basically means there is a 41% chance the team batting is going to lose (or the first team is going to win) and a 58% chance that it will win. Thus based on these results we can easily make our percentage prediction.
My predictor (along with the Twitter bot) has been running for quite some days now and has been making its prediction on all live T20 matches. And honestly, I am quite surprised at how good the predictions have been. Now predicting a cricket match, especially a T20 match is not easy, even with a big data set, a complex model, and all kinds of features, you cannot take into account when Afridi is going to go berserk and win you a lost match single-handedly.
What I was more interested in, was how my predictor reacted to changes in matches. Based on the current situation of the match can it make the right prediction, even though the prediction may not hold till the end of the match due to some match-changing performance.
Based on these metrics, my predictor actually performed quite well. For a given scenario it made predictions that made sense. If the chasing team was scoring well, had wickets in hand it would make the right prediction of the chasing team having a higher probability of winning.
On the other hand, if the target was high, and the chasing team had lost early wickets with not many scores on the board, it would rightly predict the bowling team had a higher chance of winning
It also catered well to changing scenarios. In case the chasing team needs 25 runs in 11 balls, has already lost 5 wickets, it predicts the bowling team has a slightly higher chance of winning
But if Wahab Riaz out of nowhere decides to hit Steyn for two massive sixes bringing the target in arms reach, it quickly adjusts its prediction as well.
Across the multiple matches the predictor has ran on, the results have been outstanding. The predictor adjusts well for changing scenarios and the predictions are based on the current situation. With a little more tweaking and using maybe a more complex algorithm, we may be able to achieve even better predictions
In the next part of this series we will discuss how I created a Twitter bot, that waits for a match to start, gets its score, and then makes a prediction and tweets it. Once the post is up I will link it here
Thank you for reading the article, if you have any questions feel free to reach out to me on my LinkedIn or Twitter
PSL Data Set: https://www.kaggle.com/hassanj576/pakistan-super-leaguepsl-ball-by-ball-20162020
IPL Data Set: https://www.kaggle.com/patrickb1912/ipl-complete-dataset-20082020
Code for Transforming and Training: https://github.com/hassanj-576/t20_predictor
My LinkedIn: https://www.linkedin.com/in/shjalil/
My Twitter: @hj576