NBA Salary Evaluation & Projection with Python Machine Learning
On December 15, 2020, after consecutive NBA MVP wins, Giannis Antetokounmpo signed the largest contract in NBA history worth $228 million over five years. This nine-figure contract came after multiple years of success in Milwaukee from the “Greek Freak,” and may result in a future championship. Large deals like this have become a fixture in basketball, and there are no signs of stopping. However, as in all professional sports, certain contracts work out better than others. When these deals are unsuccessful, teams are left paying large sums for players that should be on the bench. Professional teams have an idea of a player’s worth, but to sign them, managers inflate the numbers to ensure their signature goes on the paper, and stars play in their team’s jersey. I believe teams looking to build the best team must ensure they manage their money well. A player’s statistics should influence their salaries and be the basis. Using Python’s machine learning capabilities to create a relevant salary projection using individual player statistics. The projections will evaluate the fairly paid, overpaid, and underpaid from the 2019–2020 season.
NBA Salary Cap — History and Explanation
A salary cap in professional sports is used to limit individual teams’ spending to maintain “competitive balance across the league” (Miller, 2018). Before the cap’s inaugural season in 1984, teams could spend as they will sign players. By implementing the cap, the league has tried to prevent championships from being bought by the most successful team. Determined in the current Collective Bargaining Agreement (CBA), players will receive “between 49 to 51 percent” of the basketball-related income, or BRI.
The BRI is generated from revenue across the league through tickets, broadcast rights, sponsorships, and many other things. Due to the large increase in league interest and broadcasting, the cap for the 2020–2021 season will be $109,140,000 (Salary Cap Rumors | Hoops Rumors, 2017). The cap determines the maximum team payroll and force strategy amongst the distribution of payments. And if a team exceeds the cap, they will be penalized.
To avoid these penalties, teams must either make sacrifices by cutting or trading players or plan ahead long-term. As many know, a team planning long-term can be thrown off by an injury or disgruntled player. From Derrick Rose’s injury history derailing his large deal to Kyrie Irving demanding a trade away from LeBron James, things happen and force change. As such, teams need to find players who can fill the holes or slow the leaking, but they must do so with cap constraints. However, through advancements in sports analytics, teams can find “discount” players using machine learning and statistics.
Data Harvesting and Cleaning
Python’s machine learning capabilities made it the most appropriate language for this project, and as such, Pandas, NumPy, CSV, Matplotlib, and Sci-kit Learn were imported to help shape and run the data efficiently. Specifically, Scikit-Learn’s library of models provided clear outlines for designing the system. However, to fill this system, data was collected from NBA Stats, Hoops Hype, and a dataset developed by Chris Davis.
NBAStats.net has provided access to the seasonal statistic for every player since 1985. The purpose of this data was to match a player’s statistics to his salary. Theoretically, a player’s pay should be directly derived from their play. Two cleaning actions occurred on this data. The first was to specify each player’s primary position and eliminate the non-primary. The downloaded CSV file had over 26 different positions, and I believed that their position might impact the payment. As such, I used Excel to correct the issue. The second issue, NA values were replaced with 0s to avoid errors in the model’s training. The statistics data will be the independent variables in the projections. In addition to player statistics, the salary data came from Hoops Hype, and a Data-world imported CSV file made by Chris Davis. The only change to this data was converting the salaries from strings with a “$” to a numeric. Combining the salaries datasets provided every yearly pay by every NBA player since 1985. Using the Pandas merge function, a complete year-by-year data frame with both salaries and player statistics. Following the data frame’s completion, I wanted to evaluate trends in the market to gather a logical understanding of that data.
Exploratory Data Analysis
An exploratory data analysis analyzes datasets to view their main characteristics using graphs and charts. In the full dataset used for this project, I wanted to understand the distribution of positions and salary growth in the NBA.
The positions’ distribution helped gather insights on the data set players and explore the relationship between their salaries & positions. Several players had two primary positions, such as forward/guard or forward/center from the data given. For this description, we will not consider them in the rankings. From the graph above, the largest position group is shooting guards and then small forwards. The smallest position group is point guards and then centers. Following the understanding of this distribution, it intrigued me to see which of these positions had the highest median salary.
Power forwards have the highest-paid median salary at $2.06 million, while the lowest-paid position has been point guards. While you see guards, such as Ben Simmons, sign contracts that pay over $35 million a year, they are not the typical positional player. Outstanding players deserve the pay they earn. The power forward position has a median yearly pay of $2.06 million. The median pay for these backcourt players is substantially less than the frontcourt, precisely a difference of ~$404 thousand. Considering these positions’ counts show that there are more frontcourt players in the league than backcourt, supply and demand law applies. The larger the number of players available, the less their salary will be at the median. After considering these findings, it’s essential to understand the growth of salaries throughout the years.
The data set contains data from approximately 1980, and since then, NBA player median salaries have grown more than 500%. The increase has come due to an increase in interest in the NBA. While there’s evidence that the NBA’s viewership has declined, but the NBA has shown an appeal to a “larger overall audience than the NFL, especially with younger fans” (Raphael, 2019). In the 2018–2019 season, the NBA set a record for sold-out games with 760 for the fifth year in a row. This trend suggests that the league will only continue to grow, and as such, the salaries will too. This exploration of the NBA’s salary growth and positional salaries has yielded interesting findings, especially the proof of the law of supply and demand that appears in the league. After the conclusion of the exploratory data analysis, the evaluation models were built.
Building the Model
The purpose of this experiment is to evaluate the salaries of the past using a machine learning algorithm. The model should only evaluate the highest correlated statistics to salaries to accomplish this goal, following determining these relationships and training the models to create projections using only these values. The two models used in this project were logistic and linear regressions due to the theory of a direct relationship between performance and pay. The first step, however, was determining the highest correlated values.
According to this heatmap, some variables had stronger correlations than others. Based on past experiments, such as my NBA Playoff predictions, I sought correlations above 0.4. While this is a weaker correlation than preferred, the highest correlation is 0.52 points per game. Thus, this selected number of categories will yield the best projections, and those categories are shown in the heatmap below:
After filtering the data frame to only these variables, the data frame is now 12,749 rows by 17 columns. Under standard data analytics practice, the data was split into a training and testing set. Linear and logistic models built the projections for each player’s fair value salaries in the 2019–2020 season.
Last season, the NBA had 353 players under contract. As you can see in the table above, several players are undervalued in their current contracts, and several are overvalued. At the beginning of the project, I created three optimization questions: which players are fairly paid, which players are underpaid, and which players are overpaid? First, the fairly paid.