To make our dashboard more usable to neophytes or people with limited knowledge about the deadly Corona Virus, we’ve tried to employ a chatbot that can help to solve queries regarding the Pandemic outbreak.
The data is procured from the Frequently Asked Questions section of the official website of the Center for Disease Control and Prevention using requests and the BeautifulSoup library. The data includes 70 different questions regarding general awareness towards Novel 2019 Coronavirus. The queries and their solutions are collected separately and dumped in JSON files, which are then aggregated to create a useful data frame.
Following up with the chatbot, the Bag-of-Word model was employed using TF-IDF Vectorization. As usual, we can’t directly feed textual data to our model, rather we need to convert them to feature vectors, this is where TF-IDF helped. It stands for “Term Frequency-Inverse Document Frequency” that stores components of resulting scores assigned to each word. Some words like, “the”, “is” might appear a lot often in our document, but that certainly isn’t going to help our encoded vector. The goal of the TF-IDF vector is to calculate the word frequency scores for the highlighted text that are more interesting. “Term Frequency (TF)” calculates the frequency for each word, whereas, “Inverse Document Frequency (IDF)” downscales the score of much frequently occurring word.
Keeping in mind, that there is a high chance that users will not enter the same question as fetched and stored in our corpus, though we can expect to match the meaning and insights feeding the same question to our model is far-fetched. To resolve this challenge we have used Cosine-similarities that is used to determine the similarity between texts regardless of their size. It tends to determine the cosine angle between two vectors that are projected in multidimensional space.
Another salient feature of our dashboard can be regarded as the prediction of active, recovered, and death cases. The data fetches is continuous dataset and therefore, is well suited for regression analysis as it needs to predict from continuous dependent variables from various independent ones. The relation between dependent and independent variables can be defined by the coefficient of both variables in the regression mathematical statement.
Since Linear Regression is supervised learning, therefore, we need to provide it with past data, and to do so we have collected the data from “1 Jan 2020” and provided it with actual value to plot the hyperplane and predict future values each for Active, Recovered and Death cases.
Same trend was followed for SVM Regression. SVM is basically used as a classifier but when we try to increase the margin rather then decrease it, it shows the property of regression and can be used for prediction modelling.
Predicting Recovered Worldwide Cases:
Predicting worldwide Death Cases:
Predicting Death cases for US:
Link to repository: dakshtrehan/Interactive-Covid-19-Dashboard
Link to Dashboard: http://interactivecovid19dashboard.herokuapp.com/
Link to Published Paper: COVID 19 Trend Analysis using Machine Learning Techniques — IJSER Journal Publication
Link to Portfolio: www.dakshtrehan.com