Web Scraping With Python

Beautiful Soup

Now we will use the Beautiful Soup library to parse the response from our HTTP request and retrieve the batting averages of the cricket players. We will then make a pandas dataframe out of that information and then save our dataframe to a CSV file.

Installing the Beautiful Soup Library:

Before we move on, we must install the beautiful soup library with the following command:

pip install beautifulsoup4

Install HTML Parser:

We then must install a parser to parse the HTML that we retrieved from our HTTP request. We will install the LXML parser using the following command:

pip install lxml

Importing Beautiful Soup library:

We will then import the Beautiful Soup library:

from bs4 import BeautifulSoup

Creating a Beautiful Soup Object:

Now that we have our source variable from above, we can pass it into our BeautifulSoup constructor to create a BeautifulSoup object below:

soup = BeautifulSoup(source, 'lxml')

Note: We passed in both the HTML that we retrieved from the text attribute of our response object and our HTML parser. We then assigned this BeautifulSoup instance to the variable soup.

If we print this soup object we would see the HTML from the webpage. However, to format it into a more readable format, we can use the prettify method:

print(soup.prettify())

Notice that we now can clearly see which HTML tags are nested within each other.

Accessing Tags in a Beautiful Soup Object:

HTML is organized into tags. There are two ways to access a tag and its content from the soup object that we created: 1) using the dot operator (similar to accessing an object’s attribute), or 2) using the find method.

Using the dot operator:
For example, if we want to access everything in the body tags from our webpage, we would use the following:

soup.body

Note: this will return both the tags and the content within those tags.

However, the issue with using the dot operator is that it’ll only return the first tag it encounters within the HTML with the specified tag name. For example, if we use soup.div, it’ll return the very first div tag and its contents. Therefore, it is difficult to specify which tag and its contents we are after using the dot operator. That’s where the find method comes in.

Using the find method:
For the find method, we are able to pass in multiple arguments that’ll specify exactly which tag we are seeking by using tag attributes such as id or class. We first pass in the name of the tag, then we can specify the attributes of the tags, such as the specific class or id of the HTML tag. For example, if we want the tags that include the batting averages table from our webpage, we can right click on the table, click on inspect, and see that our table is included within the table tag with the class attribute equal to engineTable.

So to access the first occurrence of a table tag with the class of engineTable, we can use:

soup.find('table', class_='engineTable')

This will return the first occurrence of a table tag with the class of engineTable and all its contents. Notice that we used the class_ parameter (instead of class) because class is a python keyword.

In order to find all instances of a tag with the specified attributes and not just the first occurrence, we would need to use the find_all method. The find_all method will return a list of all the tags that match our arguments. We can then loop over that list and extract specific information from it. However, for our example, the find method works since we are only interested in the first instance of that tag with that specific class.

Retrieving The Information:

For this tutorial, we will instead target the tbody tag in order to retrieve the batting averages, since that would be a bit cleaner to go through.

table_body = soup.find('tbody').text

Note: We used the text attribute to only extract the text without the HTML tags and assigned it to the variable table_body.

Cleaning Up Our String:

If we look at our table_body variable, we can see that it is a very large string that contains the contents of our first tbody tag from our soup object. We are essentially done with the web scraping process at this point. We will use our knowledge of string methods to turn this string object into something that we can input into a pandas dataframe.

n represents a line break. We can see that in the very beginning and end of the string we have two line breaks, or nn. We can also see that entries from each row in our table are separated by one line break, or n. Lastly, we can see that each row is separated from the next by three line breaks, or nnn. Using this knowledge, we will be able to convert this string object into a different object that we can then input into the pandas dataframe method to create a pandas dataframe. This object can be a list of lists, with each inner list representing a row in our dataframe.

First, let’s remove the two line breaks at the beginning and end of our string using the slice notation.

table_body = table_body[2:-2]

We will then use the string split method and split on the three line breaks (nnn) to separate our string into a list of strings, with each string or element in our list corresponding with the entries of one row. We will assign this list to the variable table_list.

table_list = table_body.split('nnn')

The output of table_list is now a list of strings:

We can now loop through this list of strings and apply the split method for each element/string in our list, splitting on one line break (n), creating a list of lists. Each list will then contain the values of an entire row, with each element corresponding to the entry of one column.

for index, element in enumerate(table_list):
table_list[index] = element.split('n')