
Being a data science novice seeking to improve my skills, I continuously go through the competitions I have previously entered and seek to improve their accuracy. One such competition I have reviewed is Analytics Vidhya’s JetRail time series analysis.
There are many ways that one can predict future figures, such as Random Forest, statsmodels functions and Facebook Prophet. In both statsmodels and Prophet there are ways that one can check if a date is a weekend or a holiday, but this can be tricky. For example, I do not know what country JetRail is based in, but I am assuming it is in a western country. Even if JetRail is in a western country, however, it is important to know the holiday schedule for the country it is in before an accurate prediction can be obtained. With this in mind, I decided to assume that JetRail is based in a western country and the work week is from Monday to Friday and the weekend lasts from Saturday to Sunday.
I have previously written about the JetRail dataset as a univariate time series analysis problem, with the link to this post being found here:- How I solved the JetRail time series problem with FB Prophet | by Tracyrenee | Python In Plain English | Nov, 2020 | Medium
In this post, however, I have converted the univariate dataset to a multivariate dataset in an attempt to improve the accuracy. If you would like to know what happened then please read on.
The problem statement and datasets can be found on Analytics Vidhya’s JetRail competition page, the link being here:- Time Series Forecasting (analyticsvidhya.com)
The .ipyn file for this competition question was created in Google Colab, a free online Jupyter Notebook that can be used from any computer that has internet access.
The problem statement for this competition question reads as follows:-
“Welcome DataHacker!
Congratulations on your new job! This time you are helping out Unicorn Investors with your data hacking skills. They are considering making an investment in a new form of transportation — JetRail. JetRail uses Jet propulsion technology to run rails and move people at a high speed! While JetRail has mastered the technology and they hold the patent for their product, the investment would only make sense, if they can get more than 1 Million monthly users with in next 18 months.
You need to help Unicorn ventures with the decision. They usually invest in B2C start-ups less than 4 years old looking for pre-series A funding. In order to help Unicorn Ventures in their decision, you need to forecast the traffic on JetRail for the next 7 months. You are provided with traffic data of JetRail since inception in the test file.”
Because many of the libraries I need to solve this question are already installed on Google Colab, I only needed to import those libraries into the program, being pandas, numpy, seaborn, matplotlib, fbprophet and sklearn.
I then loaded and read the datasets into the program, being train, test and sample:-
I decided to convert the univariate time series dataset to a multivariate dataset and I accomplished this by adding a column, “dayofweek”. The function, dayofweek, returns a value from 0 to 6 signifying what day of the week the sampling occurred:-
I then created an additional column from the index, which is in datetime format. This column is necessary to perform a datetime analysis.
I then changed the names of the columns to names that Prophet wants to see when it is training and fitting the data:-
I created variables, ID_train and id_test, which stored the data train.ID and test.ID respectively. These columns were then dropped from the datasets because they are not needed to carry out the computations:-
I plotted a graph of the train dataset because it is important to have a visual representation of how the number of passengers have increased with time:-
I split the train dataframe in two to separate it into training and validation sets. The splitting is based on the date of this time series analysis:-
I defined the model, being Facebook Prophet. Prophet normally only wants to see two variables, being “y” and “ds”, but it is possible to add an additional variable, “add1”, which I did in this instance.
I forecasted on the validation set to obtain yhat:-
I then plotted a graph of the training and validation datasets’ time serious analysis to visually illustrate how Prophet has predicted on the numbers of passengers of JetRail:-
I then forecast on the test dataset to obtain yhat for that dataset:-
I produced a graph of Prophet’s predictions of the test dataset and it can be seen visually that the number of passengers is anticipated to increase in an ascending fashion:-
I prepared the submission from the value, yhat, and put it on a dataframe, which I then converted to a .csv file:-
When I submitted the predictions to Analytics Vidhya’s solution checker I achieved an accuracy of 365.16, which was less than 1 point better than the model I had previously submitted that was univariate. I decided to make the predictions integers, and this reduced the accuracy to about half a point better than the previously submitted univariate version.
I thought that if the days of the week data had improved the accuracy of the predictions then whether or not the day was a weekend might provide further illumination, so I added code to create an extra boolean column that stated whether the day in question was a weekend, and submitted the amended code to the solution checker. Sadly, this extra data did not increase the accuracy of the model, but actually reduced it. The code for this amendment is on my personal Google Colab account, but if anyone wants me to post that code, I will be more than happy to:-
The code for this post can be found in its entirety in my personal GitHub account, the link is to the right:- Jet-Rail-TS/AV_JetRail_Multivariate_Prophet.ipynb at main · TracyRenee61/Jet-Rail-TS (github.com)