In past posts I have written about how COV19 can be predicted upon using time series analysis using both Facebook Prophet and statsmodels, with a link to a recent post being found here:- How Artificial Intelligence is used to predict new cases of Coronavirus in the UK for January 2021 | by Tracyrenee | Artificial Intelligence in Plain English | Dec, 2020 | Medium
In the post I cited, I predicted on the COV19 new cases using Facebook Prophet and the root mean squared error for the prediction on the validation set was 8462.
In this post I intend to explore another method of predicting on a univariate time series using only numpy for algebraic equations and predictions, pandas for data manipulation, matplotlib for graphical analysis, and finally sklearn for making predictions.
I created the program in Google Colab because it is a free online Jupyter Notebook that has Python and several libraries already installed. I would like to point out, however, that the most current versions of the libraries are not always installed, so it may be necessary to install those upgrades before the most current functions can be used.
Once I created the program in the Jupyter Notebook, I imported the main libraries that I would need, being pandas, numpy and matplotlib.
I then read in the csv file that had the statistics of the in it, from the Our World in Data website. This dataset is updated daily, so it will need to be reloaded into the directory it is stored in daily. One observation I have made is the file gets larger every day. The link to Our World in Data csv file can be found here:- Coronavirus Source Data — Our World in Data
Because I was looking for the new cases concerning the UK in particular, I filtered out the UK’s statistics from the dataframe:-
I then created a univariate dataframe that contains time series data on the new cases of COV19:-
I then time stamped the date column and set it as the index to enable the time series analysis to take place:-
Using matplotlib, I created a graph of the time series new cases. As can be seen in the graph, there have been three surges of COV19 new cases, which has led to three separate lockdowns. The two most recent lockdowns are the result of the fact that there is a UK variant to the virus that is much more infectious than the original version of the disease:-
Pandas has a lag_plot function that is a scatter plot for a time series and the same data lagged. A lag plot can be graphed to see if the time series data follows any pattern. A lag plot is drawn by representing the time series data in the x axis and the lag of the time series data point in the y axis. By drawing a lag plot, patterns like randomness, seasonality and other trends can be searched for. The diagram below is the lag plot that I created on the UK’s COV19 new cases:-
A correlation matrix is a matrix structure that helps to analyze the relationship between the data variables. It represents the correlation value between a range of 0 and 1. The positive value represents good correlation and a negative value represents low correlation and value equivalent to zero represents no dependency between the particular set of variables. Below is a correlation matrix I created on the data that represents the UK’s new cases. As can be seen, there is a strong correlation the data in various points in time:-
Autocorrelation plots are a commonly used tool for checking randomness in a data set. This randomness is ascertained by computing autocorrelations for data values at varying time lags.The diagram below is an illustration of the autocorrelation plot of COV19 time series new cases data:-
Using pandas, it is possible to use autoregression to carry out time series predictions. Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value of the next time step. An equation for autoregression is found below:-
The regression equation stated above uses data from the same input variable at previous time steps, it is referred to as auto regression, or regression of the self. The code below refers to a popular autoregression technique. The rmse for the below autoregressive prediction is 5534, which is 2928 less than the rmse achieved when I had used Facebook Prophet as the model in my earlier post :-
Once I completed the prediction using only pandas, numpy and matplotlib, I wanted to see what the result would be if I used a function from the sklearn library and see if I could achieve a better result. I therefore created a new dataframe that contained two new columns to represent the data from the previous day and the difference between the two:-
I then defined the training and validation datasets, and checked their shapes:-
Once I had created the datasets for training and fitting the data, and then predicting on it, I selected the model. I had tried out several models before I settled on linear regression, which gave me the best accuracy and an acceptable rms error.
Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted is called the dependent variable. The factors that are used to predict the value of the dependent variable are called the independent variables. Linear regression can only be used when one has two continuous variables — an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome.
The equation for linear regression is stated below:-
When I selected the model, I trained and fitted it into sklearn’s LinearRegression() function. I achieved 95.88% accuracy using this model.
When I predicted on the validation set, I achieved 85.89% accuracy and rms error of 5703, which is higher than pandas, autoregression technique that I had used earlier:-
Below is the graph I plotted to compare the actual values, depicted by a blue line, to the predicted values, which are depicted by the red dots:-
I plotted a second graph to plot the time series of the actual values as compared to the predicted values:-
In conclusion, pandas’ autocorrection method of predicting on the COV19 new cases in a time series outperformed both sklearn’s LinearRegression() and Facebook Prophet. It just goes to show that sometimes simpler methods of prediction can be better.
The code for this post can be found in its entirety in my personal GitHub account, the link of which is here:- COV19/COV19_UK_pandas,_numpy,_stats_&_sklearn.ipynb at main · TracyRenee61/COV19 (github.com)