Note:

- The location represents where a player is standing when the shot is thrown, and it is represented by x-y coordinates.
- The X coordinate is measured in feet and represents the distance from the center of the court, length-wise. -47 represents the baseline of the offensive team’s end. 47 represents the baseline of the defending team’s end.
- The Y coordinate is measured in feet and represents the distance from the basket, width-wise. -25 represents the right side of the court, 25 represents the left side of the court (for someone facing the offensive basket).
- In the output, there are empty values which mean that the corresponding shots or free throws are made. Since we only care about the scenarios where there are rebounds, we need to handle this later.

## Data Cleaning

To begin with, let’s take out the undesirable row values:

`# remove the rows where the shots or free throws were made`

train = train[train['f.oreb'].isna()==False]

Next, match the rebounded player id with his position in his team and the team condition (offending/defending):

# target columns is a list containing the input columns' names

target_columns = []

for event in ['off', 'def']:

for i in range(1, 6):

target_columns.append('playerid_' + event + '_player_' + str(i))reb_player_id_df = train[target_columns].eq(train['reb_player_id'], axis = 0)

reb_player_position_df = reb_player_id_df.idxmax(1).where(reb_player_id_df.any(1)).dropna()# encode all players on court

# 1~5 means a player is an offending one while 6~10 means a defending one

position_code = {

'playerid_off_player_1': 1,

'playerid_off_player_2': 2,

'playerid_off_player_3': 3,

'playerid_off_player_4': 4,

'playerid_off_player_5': 5,

'playerid_def_player_1': 6,

'playerid_def_player_2': 7,

'playerid_def_player_3': 8,

'playerid_def_player_4': 9,

'playerid_def_player_5': 10

}

output = reb_player_position_df.apply(lambdax: position_code[x])# reset the index

output = output.reset_index(drop=True)

Now, to many degrees, normalized data usually performs better in machine learning because it reduces the influences from outliers and avoids falling into local optimal points. Thus, since we have a certain range for both x and y coordinates, we could try Min-Max normalizer:

`train[[col for col in location_columns if '_y_' in col]] = (25 - train[[col for col in location_columns if '_y_' in col]]) / (25 - (-25))`

train[[col for col in location_columns if '_x_' in col]] = (47 - train[[col for col in location_columns if '_x_' in col]]) / (47 - (-47))

Now the data is ready to go!

## Model selection

I tried a suite of models that are expected to perform well in probability prediction, including Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Gaussian Naive Bayesian Classifier, and Multinomial Naive Bayesian Classifier.

`# define models`

models = [LogisticRegression(n_jobs=-1), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), MultinomialNB()]

Cross-validation:

names, values = [], []# evaluate each model one by one

# and store their names and log loss values

for model in models:

# get a name for the model

name = type(model).__name__[:15]

scores = evaluate_model(train, LabelEncoder().fit_transform(output), model)

# output the results

print('>%s %.3f (+/- %.3f)' % (name, np.mean(scores), np.std(scores)))

names.append(name)

values.append(scores)

The result shows that Linear Discriminant Analysis outperforms all other counterparts, so I selected it as my kernel algorithm.

`# save the model`

dump(LDA, 'LDA.joblib')