
3. Import the Libraries
The first thing that we need to do is to import the most common data science libraries such as numpy, scipy, and pandas. We also need to import matplotlib and seaborn so that we can plot graphs to help us visualize the data.
# Common imports
import numpy as np
import pandas as pd
import os
import seaborn as sns
from scipy import stats
import missingno as msno# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import style
sns.set(style='ticks', color_codes=True)
sns.set(style='darkgrid')
import plotly.express as px# Ignore useless warnings (see SciPy issue #5998)import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
4. Download and Explore the Data
I saved the dataset to my GitHub repository which makes it easier for me to download the dataset without having to search for it on my machine. Having a function to download the data online is really useful if the data is constantly changing. It also saves the trouble of having to install the dataset on multiple machines.
import urllib.requestDOWNLOAD_ROOT = "https://raw.githubusercontent.com/jpzambranoleon/jpzl-ML/master/"TRAIN_PATH = os.path.join("datasets", "players_20")
TRAIN_URL = DOWNLOAD_ROOT + "datasets/players_20.csv"def fetch_train_data(train_url=TRAIN_URL, train_path=TRAIN_PATH):
os.makedirs(train_path, exist_ok=True)
csv_path = os.path.join(train_path, "players_20.csv")
urllib.request.urlretrieve(train_url, csv_path)
The fetch_train_data() function is used to fetch the dataset.
fetch_train_data()
The dataset is downloaded and saved in the files section of Google Colabs under the folders datasets and players_20. Now all that’s left to do is to load the data into a pandas DataFrame. Luckily, there is a function that we can create to return a pandas DataFrame object.
import pandas as pddef load_train_data(train_path=TRAIN_PATH):
csv_path = os.path.join(train_path, "players_20.csv")
return pd.read_csv(csv_path)
I called the load_train_data() function into player_data variable to save the DataFrame into a variable to make things easier. The DataFrame’s head() method is used to look at the top five rows of the data.
player_data = load_train_data()
player_data.head()
I used the shape method to return a tuple representing the dimensionality of the DataFrame. By using the shape method, we can see how many rows and columns there are in the dataset.
player_data.shape(18278, 104)
As we can see from the tuple, the dataset has 18,278 rows and 104 columns. We don’t need to use all the columns, so it’s best if we truncate the DataFrame to only include the target column and the player stats columns. There are 34 player stats columns so we’ll need to select those.
#player stat features
columns = ['overall','attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 'mentality_penalties', 'mentality_composure', 'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes']player_df = player_data[columns]player_df.head()
I used the info() method on the new DataFrame to get a quick description o the data.
player_df.info()
We can see from the data that there are 18,278 non-null values in the DataFrame which means we don’t have to worry about missing values.
To get a statistical description of the data, we can use the describe() method.
player_df.describe()
Just for good measure, let’s use the shape method again to see the dimensionality of the new DataFrame.
player_df.shape(18278, 35)
Another great way to get a feel of the data is to plot a histogram of every numerical attribute. To plot a histogram of every attribute, just call the hist() method on the DataFrame. The hist() method requires the matplotlib library. You can specify the number of bins for each histogram and you can also specify the size of the figure. I chose to go with 40 bins and a figure size of (27,17), but all of that is completely up to you. Histograms are useful because they show the number of instances (y-axis) that have a given value range (x-axis).
player_df.hist(bins=40, figsize=(27,17))
plt.show()
As we can see from the figure, each histogram has a value range of 0 to 100. This is a good indication that every attribute has the same scale which means we don’t have to do any feature scaling transformations.
For good practice, I like to be completely sure that there are no missing values in the data, so I use the isna().any() method to return a True or False statement on whether or not the data has missing values.
player_df.isna().any()
As we can see from the output, the return value is False on every attribute which means that there are no missing values. We are ready to continue!!!
5. Split the Data into Training and Testing Sets
After exploring the data, the next thing we need to do is to split the data into training and testing sets. We can do this by splitting the data ourselves, or by creating a function to do it for us. Luckily for us, Scikit-Learn has a cool feature that can split the dataset into training and testing sets. The train_test_split() function can easily split the data into the sets we need. I set test_size=0.2 so that 20% of the data is stored in the test set. The other 60% of the data is stored in the training set. The training set should have more of the data because it is the set which the model learns from. The testing set will be the set that we use to validate the model.
from sklearn.model_selection import train_test_splittrain_set, test_set = train_test_split(player_df, test_size=0.2, random_state=42)print("Length of training data:", len(train_set))
print("Length of testing data:", len(test_set))
print("Length of total data:", len(player_df))Length of training data: 14622
Length of testing data: 3656
Length of total data: 18278
6. Look for Correlations
You can compute the standard correlation coefficient between each pair of attributes using the corr() method. Looking at all the correlations between each attribute will take up too much time, so let’s just see how much each attribute correlates with the overall value.
fifa_df = train_set.copy()corr_matrix = fifa_df.corr()corr_matrix['overall'].sort_values(ascending=False)
The correlation coefficient ranges from -1 to 1. When the correlation coefficient is close to 1, it means that there is a strong positive correlation; for example, overall tends to go up when movement reactions goes up. When the correlation coefficient is close to -1, it means that there is a strong negative correlation (i.e., the opposite of a strong positive correlation).
To visualize the correlation between attributes, we can use the pandas scatter_matrix() function. This function plots every numerical attribute against every other numerical attributes. Let’s just look at the top five promising attributes to make things easier since there are 35 numerical attributes.
from pandas.plotting import scatter_matrixattributes = [‘overall’, ‘movement_reactions’, ‘mentality_composure’, ‘power_shot_power’, ‘mentality_vision’]scatter_matrix(fifa_df[attributes], figsize=(15,12))plt.show()
Strong correlation relationships tend to be linear, so the most promising attribute to predict overall is movement reactions. As we can see from the figure, movement reactions and overall have a very strong linear relationship.
Let’s take a closer look at the correlation scatterplot.
fifa_df.plot(kind=’scatter’, x=’movement_reactions’, y=’overall’, alpha=0.1, color=’red’)plt.show()
The correlation is indeed very strong; you can clearly see an upward trend.
Now, it’s time to prepare the data for ML algorithms.
7. Prepare the Data for ML Algorithms
I split the training set and testing set into separate sets of features and targets. The DataFrame y_train and y_test contain the target values (the target value is overall) and X_train and X_test contain the feature values (every other attribute that correlates to the target value).
y_train = train_set['overall']
X_train = train_set.drop('overall', axis=1)
y_test = test_set['overall']
X_test = test_set.drop('overall', axis=1)
8. Select and Train a Model
Let’s first train a Linear Regression model using Scikit-Learn’s LinearRegression() function:
from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
Let’s measure the regression model’s RMSE on the whole training set using Scikit-Learn’s mean_square_error() function:
from sklearn.metrics import mean_squared_errory_predictions = lin_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse2.493574884183225
Not a bad score!!! It seems that a Linear Regression model could do the trick. From what we saw from the scatter matrix is that not every attribute has a linear relationship. For nonlinear relationships, we would need a more powerful model so let’s train a Decision Tree Regression model. This is a powerful model, capable of finding complex nonlinear relationships in the data. Let’s use Scikit-Learn’s DecisionTreeRegressor() function:
from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor()
atree_reg.fit(X_train, y_train)
Again, let’s measure the model’s RMSE:
y_predictions = tree_reg.predict(X_train)
tree_mse = mean_squared_error(y_train, y_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse0.0
What?! An RMSE of 0.0!!! Is this a perfect model?
Actually, it could be that this model has badly overfit the data. That’s why in every ML project it is important that you perform model validation, but we’ll get to that later.
For now, let’s just train our final model which is a Random Forest Regression model.
We can train a Random Forest Regression model using Scikit-Learn’s RandomForestRegressor() function:
from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)
Let’s look at the RMSE for this model:
y_predictions = forest_reg.predict(X_train)
forest_mse = mean_squared_error(y_train, y_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse0.4774070457668509
It’s a better score than our Linear Regression model, but it’s not as good as our Decision Tree model. Again, it could be that our Decision Tree model has badly overfit the data, which is why we always need to validate each model before selecting the best one. An easy way to validate a model’s accuracy is to use Scikit-Learn’s K-fold cross-validation feature. Cross-validation randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the models 10 times, picking a different fold for evaluation every time and training on the 9 folds. The result is an array containing the 10 evaluation scores. We need to take a look at the mean validation score before we can choose the best model.
Let’s look at the cross-validation scores of our Linear Regression model:
from sklearn.model_selection import cross_val_scorescores = cross_val_score(lin_reg, X_train, y_train,scoring='neg_mean_squared_error', cv=10)lin_reg_scores = np.sqrt(-scores)def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())display_scores(lin_reg_scores)Scores: [2.48985397 2.509888 2.50910312 2.4642167 2.46189444 2.51445854 2.48370267 2.4734443 2.48382446 2.59475861]
Mean: 2.4985144826450054
Standard Deviation: 0.03662607159427437
Now let’s look at the cross-validation scores of our Decision Tree model:
scores = cross_val_score(tree_reg, X_train, y_train,scoring='neg_mean_squared_error', cv=10)tree_scores = np.sqrt(-scores)display_scores(tree_scores)Scores: [2.03976196 2.10100389 2.24933491 2.18751588 2.13289653 2.1158318 2.10529916 2.08391009 2.30935171 2.15395796]
Mean: 2.147886387907302
Standard Deviation: 0.07682311253910523
Now that’s a more believable score than the first. It’s still better than our Linear Regression model, but let’s look at the cross-validation score of our Random Forest model:
scores = cross_val_score(forest_reg, X_train, y_train,scoring='neg_mean_squared_error', cv=10)forest_scores = np.sqrt(-scores)display_scores(forest_scores)Scores: [1.29156727 1.24883845 1.2625419 1.288613 1.22436926 1.30612055 1.25621857 1.24755531 1.30399247 1.28656714]
Mean: 1.2716383911151843
Standard Deviation: 0.026074298378757566
Wow!!! This might just be the best score we’ve seen. Random Forests look very promising. This is the model that we’ll use for our project.
9. Fine-Tune The Model
Now that we have selected our model, we need to fine tune it. Fine tuning requires that we fiddle with the hyperparameters manually, until we find a great combination of hyperparameter values. Fine tuning is very tedious work and can take a long time if the dataset is large. You may not even have time to explore many combinations. Instead of doing everything manually, there is an easier option that Scikit-Learn provides.
Instead of fiddling with the hyperparameters manually, you should use Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with and what to try out, and it will use cross-validation to evaluate all the possible combinations of hyperparameter values. The code below provides the parameters that I used to experiment with.
from sklearn.model_selection import GridSearchCVparam_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},{'bootstrap': [False], 'n_estimators': [3,10], 'max_features': [2,3,4]},]forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(X_train, y_train)
When it’s done training, you can get the best combination of parameters like this:
grid_search.best_params_{'max_features': 8, 'n_estimators': 30}
You can also get the best estimator directly:
grid_search.best_estimator_RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=8, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Of course, you can experiment with any hyperparameter values, but I think this will work for now.
10. Evaluate the Model on the Test Set
After tweaking the model on the training set, it’s finally time to evaluate the final model on the test set. The process is pretty similar, but this time we need to specify the best estimator to the final model. Again, we’ll need to measure the performance of the final model. The code to do that is below:
final_model = grid_search.best_estimator_
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse1.135337798136083
The score is the best we have seen so far, not to mention it was also used on data that the model has not previously seen. If I’m being honest, I didn’t expect the model to do this well, but we can’t just assume that the RMSE will always be a certain value. We want to have an idea of how percise this estimate is. For this, you can compute a 95% confidence interval for the generalization error using scipy.stats.t.interval():
confidence = 0.95squared_errors = (final_predictions - y_test)**2np.sqrt(stats.t.interval(confidence, len(squared_errors)-1,loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))array([1.10181563, 1.16789818])
We can see that the RMSE will be between 1.10 and 1.17
95% of the time.
We are pretty much done with the model, but just out of curiosity, we would want to see how the predictions differed from the actual target. To do this, just selected five instances from the test set and use the predict() method on the final model:
some_data = X_test.iloc[:5]
some_label = y_test.iloc[:5]print("Predictions:", final_model.predict(some_data))
print("Labels:", list(some_label))Predictions: [64.2 73.8 68.76666667 68.13333333 63.4 ]
Labels: [64, 74, 69, 68, 63]
The predictions are pretty accurate. I’m surprised that the model performed this well. I’m very satisfied with the results.