The site contains a dropdown selector to select the individual tournaments, multiple years to select for each tournament, and a large embedded chart. I learn a few things and write a bit of a complex script in the process of collecting the data.
results_list = []
for tourney in tourney_list:
sel = Select(dropdown)
sel.select_by_visible_text(f'{tourney}')
time.sleep(5)
years = driver.find_elements_by_css_selector("text.yearoptions")
for year in years:
year.click()
time.sleep(5)
graph = driver.find_element_by_css_selector("div.table")
rows = graph.find_elements_by_class_name("datarow")
i = 0
e = 0
v = 0
scorelist = []
for row in rows:
player_dict = {}
player_dict["tournament"] = tourney
player_dict["year"] = year.texttry:
golfer = row.find_element_by_id("col_text1").text
except:
golfer = ''
player_dict["golfer"] = golfer
There is a simple package for selecting options in a selector within selenium (python scraping package), I did not have to use anything too complex, so it was quite simple in the end. I went on to collect data on about 120 golfers per tournament, for almost every tournament dating back to 2010.
Totaling 59,000 rows, the dataset was vast. After pruning tournaments without the four key statistics I would use for my Xs in the models (mostly foreign courses), I am left with 52,000 rows. Each row contains the tournament name, the course for the tournament, the year of the tournament (unfortunately not the exact date), and the golfer’s name as indicators. The panel data is primed for fixed-effect regressions.
The four key statistics in the data set are all-encompassing statistics that measure how well a golfer performs off-the-tee, approaching-the-green, around-the-green, and putting in comparison to the field. Mark Broadie, a professor from Columbia University’s business school, developed the statistic using data provided to academia by the PGA Tour. Here is strokes gained: off-the-tee explained:
The number of strokes a player takes from a specific distance off the tee on Par 4 & par 5’s is measured against a statistical baseline to determine the player’s strokes gained or lost off the tee on a hole. The sum of the values for all holes played in a round minus the field average strokes gained/lost for the round is the player’s Strokes gained/lost for that round.
https://www.pgatour.com/stats/stat.02567.html (bottom of the webpage)
Next I create dummy binary variables for each tournament in order to extract the unobserved variables that only change across courses, but stay constant over time. I create the same dummy binary variables for each year to do the same; extract unobserved variables that only change over time, and stay constant across courses.
After making lists of each unique course and a separate list for each year in the dataset, I use for loops to make new columns (regressors) in the dataframe and populate them initially with zeros, and then with ones for their corresponding courses and years.
# creating columns for each course
for course in courses:
df1[f'{course}'] = 0# populating each course column with a 1 for its respective course
# creating columns for each year
for x in courses:
df1.loc[df1.course == f'{x}', f'{x}'] = 1
for year in years:
df2[year] = 0# populating each year column with a 1 for its respective year
for x in years:
df2.loc[df2.year == x, x] = 1
The courses totaling 104, the years totaling 12, I now have a total of 120 regressors including the strokes gained statistics to regress on the golfer’s final scores.
I read in the final dataset, and randomize its order, to ensure the models are trained using data from all years and courses.
dta2 = pd.read_csv("panel_data_timeandcourse.csv")
dta2['ML_group'] = np.random.randint(100,size = dta2.shape[0])
dta2 = dta2.sort_values(by='ML_group')
Employing the TVT split (train, validate, test), I create splitting filters for the randomized dataset that filters the data by its “ML group” number (below). Allowing for 80% of the data to train the model, 10% to test and predict, and 10% validate the prediction.
inx_train2 = dta2.ML_group<80
inx_valid2 = (dta2.ML_group<90)&(dta2.ML_group>=80)
inx_test2 = (dta2.ML_group>=90)
I designate the final score data as the Y (dependent variable), and the 120 regressors we discussed as the Xs (independent variables, abbrev. below).
Y_train2 = dta2.score[inx_train2].to_list()
Y_valid2 = dta2.score[inx_valid2].to_list()
Y_test2 = dta2.score[inx_test2].to_list()X_train2 = dta2.loc[inx_train2, ['sg_putting', 'sg_arg', 'sg_approach', 'sg_tee', 'Muirfield Village GC', 'Muirfield Village Golf Club', 'TPC Louisiana', 'Sherwood Country Club', 'Sedgefield CC', ....... '2016', '2017', '2018', '2019', '2020',
'2021']]