Build your own dataset. Web scraping & Scheduler.

When we start a new project of data analysis and/or machine learning modeling there is several tasks to do before jumping or diving in coding.

Collect the data is one of the most important one. There is multiple options to get a dataset (after scoping the project). We can find the dataset on internet. The web offers thousands of link where data can be downloaded (for free or for fees) and loaded in your project. The dataset can be manually created (I play golf and collect my stats in a notebook — manually — and the fill a text file).

In this post I would like to cover a third option by speaking about web scraping.

Wikipedia resume the topics as “extracting datas from websites”.

In this story I will explain how I to build a web scraping program to collect data about Coronavirus and schedule the Notebooks to have daily fresh data.

The notebook will define a function to collect data from https://www.worldometers.info/coronavirus/. Having a quick look on the html source code we can isolate the section where the table is created and identify the structure of the table fo each country detailed statistics [‘Total Cases’, ‘New Cases’, …, ‘Population’].

Let’s import a few libraries

import bs4
import pandas as pd
import requests
import time
from datetime import datetime
import threading
import numpy as np

The bs4 and threading librairies are the focus of this story. Beautiful Soup offers methods to crawl the html source code and isolates tags to collect data. The collected data will simply be added to a dataframe which will be saved after concatenation with history (I load the original file in a dataframe, perform the web scraping on a temporary dataframe and finally merge both dataframe and save it on the original name).

Read the URL and return content

def get_page_contents(url):
page = requests.get(url, headers={"Accept-Language": "en-US"})
return bs4.BeautifulSoup(page.text, "html.parser")

Scraping function & threading

The code to scrap the content is defined in a function. The function is called by using the threading timer method. I start by creating the function scrapping_url() and set up some constants.

def scrapping_url():
data = []
url='https://www.worldometers.info/coronavirus/'
root_url = url

Then, this is the magic line to schedule and play and play and re-play the function.

    threading.Timer(30.0, scrapping_url).start()

The 30.0 indicates to the scheduler to start the scrapping_url function every 30 seconds (good to test) but in real life I set up this parameters to 36000 to collect data every 10 hours. During my analysis I manage duplicates row per days (I did that to be sure to have consistent data and not be too much impacted by the timezone).

Dealing with the page content


soup = get_page_contents(url)
entries = soup.findAll('div', class_='maincounter-number')
rat = [e.find('span', class_='').text for e in entries]    table = soup.find('table', 
attrs={'class':'table table-bordered table-hover  main_table_countries'})
table_body = table.find('tbody')    rows = table_body.find_all('tr')

Rows contains the table. With a simple for row in rows: loop it’s easy to extract data for each rows (dealing cell by cell) and storing the result in a list.

#I don't want to get data from the 1st row of the table
dont_do_1_row=False
for row in rows:
if dont_do_1_row:
cols = row.find_all('td')
d=[]
for c in cols:
v=c.text.strip()
d.append(v) #replaced i by v
temp=[]
for a in d:
if a != '':
try:
i=int(a.replace(',',''))
except:
i=a
else:
i=a
temp.append(i)
data.append(temp)            
dont_do_1_row=True

At the end of the loop, the content of the list can be added to a list of list which will be transformed in a tempora dataframe and finally merged to the global file.

colname = ['Rank','Location' , 'Cases', 'New Cases'
, 'Deaths','New Deaths', 'Recovered',
'Actives','Serious','Case-1M'
,'Deaths-1M','Tests','Tests-1M'
,'Population','Continent',
'dummy1','dummy2','dummy3','dummy4']    df=pd.DataFrame(data, columns=colname)
df['DateTime']=datetime.now()    #read row data
try:
covid_global = pd.read_excel('covid_data.xlsx, index_col=0) 
except:
covid_global = None    #append last readed line(s)
try:
covid_fusion = covid_global.append(df,  
ignore_index=True, verify_integrity=False, sort=None)
except:
covid_fusion = df.copy()    #save file with just row data scrapping from website        
covid_fusion.to_excel('covid_data.xlsx')

Call the global function and job is done !

scrapping_url()

the magic line is the call to threading.timer
the program works as long as the definition of the website doesn’t change
some website can be very difficult to analyse and those with no pure html are out of scope
I imagine that in a modeling project, fresh data can be used to validate model accuracy and trigger some fine tunning

Let’s import a few libraries

Read the URL and return content

Scraping function & threading

Dealing with the page content

Call the global function and job is done !

Footer