Building a sumo wrestling match predictor using machine learning

You can get a flavor of what actual sumo fighting looks like by watching one of the most beloved wrestlers of our time (circa 1980s), Chiyonofuji, in a series of matches. Note this doesn’t provide any idea of the ritual, pageantry, and build up that makes this sport great, this is just the actual clash which is what ultimately my project is focused on.

This is just the tip of the iceberg so if this seems interesting to you I’ll recommend additional resources below.

Small disclaimer

I work in tech as a software manager and my opinions are strictly my own. I’m taking a Udacity course on Data Science for fun and make no claim of being an expert in data science or sumo. This is a learning experience and I would truly welcome any feedback or corrections.

An appreciation of sumo aside the goal of this analysis is to tackle the following:

To create better visuals that help an interested fan understand the history of a particular wrestler. While there are some statistics available on a given wrestler in official sumo channels I personally find these stats difficult to digest. They are mostly writing, or plain dates, with no sense of context. Sure this wrestler weighs 300lbs now but is that more or less than what they have weighed in the past? Their current rank is M6 (see below) but how does their overall career look? I’d like to create a few visuals that answer questions of interest to me, specifically a heatmap which I find leads to a lot of insight.
To use some form of machine learning to predict who will win a given match. The ideal algorithm will take some information about two wrestlers and have it spit out who is most likely to win. Even after watching off and on for a few years I still find that in many of the matches I have no clue who both wrestlers are. If we could predict with some accuracy (let’s say even 55% of the time) that wrestler A would beat wrestler B that would give me some indication of who to cheer on (depending on my mood the likely winner or the underdog). Some initial reading shows that Gaussian Naive Bayes, LogisticRegression, and Decision Tree Classifier will be the most likely candidates.

As mentioned above this article is meant to enhance the solution that can be found in my SumoPredictor jupyter notebook.

The article and notebook are designed with the same headers to facilitate following along, I will opt for leaving code in the notebook and the majority of the flavor and story in the article.

The analysis is conducted from two different data sets from data.world that had exactly what I was looking for. It is a history of individual wrestlers (banzuke.csv) and the matches between them (results.csv).

For the most part, I will use English words in place of the Japanese counterpart but I think it makes for more interesting reading to sprinkle in a few Japanese terms.

Rikishi

The first of these terms is rikishi which is a sumo wrestler.

I found that rikishi was easy to remember when I was first learning sumo and will use it hereafter to discuss a particular wrestler.

Thanks to Wikipedia for this image

A rikishi can come from anywhere in the world (mostly Japan, Mongolia, but plenty of other countries) but lives in Japan. They typically start their journey early in their life and they are the star of this article.

We care about several aspects of a given rikishi as explained below.

To help understand more about rikishi let’s explore the banzuke.csv data set by each column.

Note after the lower divisions were dropped there were 0 NA values for all columns.

Column One — basho

Renamed: No

A “basho” is a tournament and is another Japanese term that I found easy to remember and will use throughout this article. Each basho takes place on an odd-numbered month of the year (Jan, March, May, July, Sept, Nov) and is held in various cities around Japan.

For each basho, there is a document released called a banzuke that outlines the wrestlers and their rankings. The data set used here is also called banzuke but outside of that, there is no need to remember this term.

The basho data we have goes from 1983 to 2020 (37 years) and is denoted in a “year.month” format (e.g. 2020.01). For my analysis, I was actually curious if the month played any part in a rikishi’s chance of winning and I split basho up into basho_year and basho_month.

Column Two — id

Renamed: No

Each rikishi has a unique id in this dataset. This is our only immutable column that identifies a given rikishi (even their name can change).

Column Three — rank

Renamed: No

Rikishi are ranked in a hierarchy of divisions and rank.

For the purpose of this analysis, I reduced the dataset to the top division only (called Makuuchi which readers do not have to memorize). The lower divisions are perfectly respectable but this top division is the best-of-the-best and the only division of interest to me.

It is possible for rikishi to rise and fall between divisions and ranks but for our purposes, a rikishi is only relevant when they are in the top division.

The top division rankings work as follows:

The bottom ranks are all called Maegashira (hereafter abbreviated to “M” no need to memorize this term) and each rikishi is given a number to indicate their relative ranking starting with the lowest M16 up to M1. These are the rank and file of the top division. To get to this level a rikishi needs to fight through five other divisions which is no small feat! But for our purposes, the M’s are the lowest ranking members.
There are always two of each M ranking at a given time (2 x M16, 2 x M15…etc). M’s rank today typically stops at M16 but in our dataset, this actually goes all the way down to M18 which was an interesting insight. It appears the rules changed in 2004.
Next up from the M’s is the rank of Komusubi, which is colloquially referred to as “the meat grinder.” Is it the most grueling rank where you are pitted against all of the top-ranked rikishi and is a sort of testing ground to see if you are ready for the top tier.
The top tier of the top division is split between three ranks: Sekiwake to Ozeki to Yokozuna (the top).
Once a rikishi has made it to the rank of Yokozuna they are that rank for the rest of their career and cannot be demoted. All other ranks including Ozeki are ephemeral and a rikishi must maintain a certain number of wins to stay at that rank.

For the rank column I performed several data cleaning and transformations:

All rankings that did not fit in the highest division were removed. This is an analysis for the top ranking only and even though a given rikishi can drop into lower divisions and pop back up we consider all divisions below as a black box.
Each rank is divided into an idea of “east” and “west” which technically speaking denotes a slight difference in rank (east is higher than west) but practically speaking it’s irrelevant and the idea of east/west is dropped in this analysis.
This column contains alphanumeric codes like Y1e, Y2eHD that require too much thinking to digest so all ranks were converted to either MX (e.g. M16 to M1) or their proper name (e.g. Yokozuna).
All ranks are hierarchical and I created another column specifically for comparing rank. In this new column, each rank was assigned a number (starting with M18 the absolute lowest provided with a rank of 0 all the way to Yokozuna ranked at 21). This ranking value was a safe way to provide a number to categorical data as there is a clear linear hierarchy between the ranks.

Column Four — rikishi

Renamed: No

The rikishi column contains the wrestler’s name. This name is not their birth name but more of a stage name similar to what you might see in western wrestling (e.g. Hulk Hogan). These names are unique and serve as a sort of identifier. However, it cannot be relied upon as an immutable identifier as rikishi can change their name for various reasons. I like the idea of keeping this column as rikishi to throw some Japanese flair in there so this column is unaltered.

Column Five — heya

Renamed: Yes — Stable

A heya is where the wrestlers live and train and can be translated as stable. Life in a stable is super interesting but outside the scope of this analysis. If you’re curious about life in a stable I would recommend this documentary.

For our purposes, I altered the column name to stable as I constantly forget what heya means.

My assumption for the stable was that it would have a massive impact on their performance. Stables are run by retired rikishi and this is where current wrestlers live, train, and learn how to become the best wrestlers they can be. A rikishi does not typically wrestle against members of his stable unless there is some sort of playoff situation.

I found the stable to be the most difficult and disappointing aspect of this analysis. It just seems natural that where you train, where you live, where you learn would have a massive impact on your likelihood of winning a match. Because there are quite a number of stables I went through several ways of handling this categorical data as we learned and as found in this helpful article.

One-hot encoding presented too many columns and made it difficult to get any insight. Label encoding showed no correlation to winning which was a complete surprise, and just for good measure I tried hash encoding which showed zero correlation as well. I took this to mean that the stable really didn’t have as much impact as I would have thought.

In all honesty, this was an aspect of sumo I was very excited to deep dive but after seeing the low correlation I decided to exclude it from the analysis as it just complicated the code and increased time to run.

Column Six- shusshin

Renamed: Yes — Hometown

Shusshin is the rikishi’s hometown. This column was unused throughout the analysis and only the column name was changed to make it easier to remember.

It’s possible that coming from a particular place would impact a given wrestler but I made a conscious choice to omit this from the analysis. This is more a matter of pride for locals and is actually a sore point for some Japanese sumo fans as Mongolians have been dominating sumo over the last few years.

Maybe an individual’s hometown or nationality has an impact on their ability to win but it feels like an ethical grey area to me so I decided to not use it even if that is at the expense of accuracy.

Column Seven — birth_date

Renamed: No

A typical rikishi begins their career quite young and this sport is grueling physically. My suspicion was that a rikishi’s age would play a big part in predicting their win. The birth date vs. the basho date was used to determine their age at the time of that particular tournament and stored in a separate column called age.

I also recently read the book Outliers which demonstrated how a person’s birth month can impact their standing in a particular sport. Super interesting so I split the birth_date into the year and month to see if there was any correlation between their month and their ability to win. There wasn’t and because finding the birth month for each column added quite a bit of additional time I removed it from the end analysis.

Column Eight and Nine — height and weight

Renamed: Yes-height_cm and weight_kg

I created two columns based on feet/inches and pounds because it was easier for me to digest. Often when watching matches someone will discuss a rikishi’s weight in kg and that just doesn’t mean anything to me. I either have to do some estimates using mental math or pull up a calculator so having these conversions handy is just a time saver.

Column Ten, Eleven, and Twelve — prev, prev_w, prev_l

Renamed: Yes-previous_rank, previous_wins, previous_losses

What I really love about sumo is the amount of pageantry and build up associated with a basho. There is a very methodical and progressive build-up throughout each day, and equally throughout the entire tournament, that gives it a really epic feeling.

While this build-up is fun for fans I imagine it takes a psychological toll on the rikishi. A given match can last only seconds but thinking about their upcoming match takes around 24 hours. There is a lot of silent downtime, a lot of time to get psyched up…or psyched out.

If a wrestler fails to get at least 8 wins they are demoted (the number of ranks they drop is determined by the Sumo Association and can vary based on the number of losses). There is also a lot of pressure on each wrestler as they are always in danger of dropping rank unless they are Yokozuna. But even for Yokozuna who are not in danger of being demoted if they have enough losses over many basho they are asked to retire.

So the wrestlers grind it out battling one another every other month, for 15 days straight, with constant pressure to be there (if you are injured and can’t fight that’s a loss). In addition the Sumo Association doesn’t give anyone an easy ride, they constantly pit winners against one another to put them to the test. If two relatively ranked individuals are doing well in a given basho, let’s say 5–0 each it’s likely they will be pitted against one another to define a clear leader early on (one will emerge 6–0 the other 5–1). So if you’re a young upstart down in the low M’s and you are having a good basho your confidence is probably running pretty high when “bam!” 24 hours before your next match you find out you’re pitted against someone far beyond your rank! What is that going to do to your confidence?

I wanted to capture the psychological impact of how a rikishi was performing (did they just get up to a new rank and forced to face a whole new class of wrestler? did they drop from a previous rank and is that impacting their judgment worrying about dropping further?) by using the previous rank, previous wins, and previous losses in my analysis.

For the analysis, the previous rank was modified to match the rank as outlined above.

With the data cleaned and massaged the visualizations were fairly straightforward.

The desired end state of this problem was to provide a given rikishi, by name, and have it pull up some relevant stats and visuals to give some more flavor to that wrestler.

While watching sumo, if you watch the full match and not just clips, there is a lot of time to explore both wrestlers. The rikishi go through a series of rituals as they prepare for their face-off (see image below for a small taste).

During this time what I wanted to have was a place where I could put their name and have it pull up things I typically wonder.

Specifically, I used some of pandas built-in visuals and seaborn’s visuals to answer the questions I typically want to know.

Here is an example of my personal favorite wrestler: Hakuho

By entering his name (since it’s a hassle to look for the id) the notebook automatically pulls up the identifier (since names can change) and pulls the data for this wrestler.

Rank History

The first thing I’m always curious about is their rank history. Looking at the official sumo page you can get an idea of this but it’s painful to digest. I want a simple time-series visualization that tells me more of a picture of the relative rankings.

To accomplish this I use the rank_as_value that I set up earlier and sorted the graph by basho_year. The time-series below gives me a pretty good sense (of what I already knew) that Hakuho climbed really fast through the ranks.

At first, I didn’t like the light blue block of color that demonstrated the rise and fall during a given year and wanted to see every rank laid out by basho. That ended up being way too challenging to digest and I got used to the light blue demonstrating the range of ranks during a given year which is all I really care about.

In this image you can see Hakuho entered the top division in 2004 he had some ups and downs but by 2006 he was in the top three ranks consistently. He hit the rank of Yokozuna in 2008 and as described above will be at that rank until he retires.

Small disclaimer

Rikishi

Column One — basho

Column Two — id

Column Three — rank

Column Four — rikishi

Column Five — heya

Column Six- shusshin

Column Seven — birth_date

Column Eight and Nine — height and weight

Column Ten, Eleven, and Twelve — prev, prev_w, prev_l

Rank History

Footer