This project is originally for my Udacity Machine Learning Engineer Nanodegree capstone project.
I found the dataset on Kaggle linked as:
I am very proud to complete this project because it challenged my skills not only in Machine Learning Engineering but also in domains such as Data Engineering and Software Engineering. I managed to learn how to use the Streamlit library in Python to build my whole ML Web app. On the web interface, you can simply start from choosing your ML model type, then adjusting hyperparameters of the model and finally selecting your evaluation metrics.
Below as an overview:
- Machine Learning Models: Random Forest Classifier (RF) & Logistic Regression Classifier (LR).
- Hyperparameters: n_estimators for RF and C for LR.
- Evaluation Metrics: Confusion Matrix; Classification Report; Accuracy Score.
As for this post, I will try my best to guide you through my project as a very practical example of using Streamlit to build a web application.
According to the information of this dataset on Kaggle, there are some particular facts that we need to be aware of at the beginning.
- First, we only need the combined dataset for our project. The publisher has kindly combined the other two datasets for us. Combined_News_DJIA.csv is the one we will be working on. đ
- Second, when we work on the train_test_split step, there is a special requirement from the publisher. I copied as below:
For task evaluation, please use data from 2008â08â08 to 2014â12â31 as Training Set, and Test Set is then the following two years data (from 2015â01â02 to 2016â07â01). This is roughly a 80%/20% split.
The dataset contains 27 columns (25 columns are top 25 headlines crawled from Reddit World News Channel). The label column represents whether the Dow Jones Industrial Average (DJIA) rose or stayed as the same (1) or decreased (0). Each row contains information for a specific date. In total, there are 1989 rows in this dataset.
Plotting out the distribution of labels, we can see the data is only slightly uneven.
Data preprocessing steps involved:
- Fill NaN values with medians.
- Clean texts in each news column.
- Combine news columns into one column named as âheadlinesâ.
The last two steps can be achieved by writing a data preprocessing function as below:
def create_df(dataset):dataset = dataset.drop(columns=['Date', 'Label'])
dataset.replace("[^a-zA-Z]", " ", regex=True, inplace=True)
for col in dataset.columns:
dataset[col] = dataset[col].str.lower()headlines = []
for row in range(0, len(dataset.index)):
headlines.append(' '.join(str(x) for x in dataset.iloc[row, 0:25]))df = pd.DataFrame(headlines, columns=['headlines'])
# data is the dataset after filling NaNs defined out of the function scope
df['label'] = data.Label
df['date'] = data.Datereturn df
Implementation steps involved:
- Tokenize the texts.
- Build a Machine Learning Pipeline.
- Perform train_test_split.
- Fit the data pipeline.
- Evaluate the Results.
To tokenize the text of headlines in first column, we need to create another function to complete this task.
def tokenize(text): text = re.sub(r'[^ws]','',text)
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer() clean_tokens = [] for token in tokens:
clean_token = lemmatizer.lemmatize(token).lower().strip()
clean_tokens.append(clean_token) return clean_tokens
This function takes a paragraph of text as input and returns a tokenized list of words as output.
After tokenization, we can start to build our ML pipeline for this project.
For me, I started from Random Forest Classifier as a benchmark model.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerpipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize, stop_words = 'english')),
('tfidf', TfidfTransformer()),
('clf', RandomForestClassifier())
])
Since the train_test_split method is already defined by the publisher, we can simply code as below:
# seperating the data into train and test by date following the instruction by the data creator
train = df[df['date'] < '20150101']
test = df[df['date'] > '20141231']# selecting features and targets
x_train = train.headlines
y_train = train.label
x_test = test.headlines
y_test = test.label
Fit the data pipeline and evaluate the results using the classification report.
# fit on the pipeline
pipeline.fit(x_train, y_train)# predicting the results
y_pred = pipeline.predict(x_test)print(classification_report(y_test, y_pred))
Our benchmark ML model (RF) achieved an accuracy of 81%.
Alternatively, we can use Logistic Regression model as our benchmark.
Logistic Regression achieved an accuracy of 82% which outperformed the Random Forest by 1%.
To further improve our model performance, I choose GridSearch CV to loop over the parameter space and returns with the best parameters. Since the Logistic Regression somehow doesnât perform very well after hyperparameters tuning, I will only show how to build a simple GridSearch CV for Random Forest Model.
from sklearn.model_selection import GridSearchCV
# method using GridSearchCVparameters = {
'vect__ngram_range': ((1, 1), (2, 2)),
'clf__n_estimators': [50, 100, 150, 200, 250, 300]
}Grid = GridSearchCV(pipeline, param_grid=parameters, cv= 3, verbose=10)Grid.fit(x_train, y_train)
When the fitting is done, you can simply return the best parameters by one line of coding.
Grid.best_params_
After refinement, we can re-evaluate our tuned modelâs performance with the classification report.
The tuned Random Forest model achieved an accuracy of 84% which has a 3% increment than the benchmark version.
As we finished the workflow in Jupyter Notebook, we are now moving to the Visual Studio to build the ML Web app with Streamlit!
We are going to use Streamlit library in Python to validate our ML model.
If you havenât heard of Streamlit before, I recommend the guided project tutorial on Coursera as an introduction.
My Web app has an interface like below:
Itâs a little complicated to explain code without showing it. So I would like to summarize the coding steps as detailed as possible.
To make it more efficient, I break the steps into followings so it wonât be confusing too much. đ
- Import all Python libraries you need to complete the Web app.
- Decompose your workflow into separate functions. For example, load_data() helps you load the dataset/ create_df() helps you clean and combine news columns into one headlines column/ tokenize() helps you convert text into list of words etc.
- Build your Web App interface in Streamlit language. It sounds difficult but actually itâs extremely easy to understand!
- Finally, you can actually write your code like you are in a Python environment.
Try Streamlit commands to build your interface:
- st.sidebar.title(âNLP News Sentimental Analysisâ) helps you create a title on sidebar.
- st.subheader(âConfusion Matrixâ) display title on main page.
- st.write(âConfusion Matrix â, matrix) display the confusion matrix results.
If you are interested in how I create my Web app, you are welcome to check the code logic below:
Libraries:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import streamlit as st
import pandas as pd
import numpy as np
import warnings
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import nltk
import re
nltk.download(['punkt', 'wordnet'])
Functions:
- load_data() returns loaded dataset from your local file path.
- create_df(dataset) inputs original dataset & outputs dataset of combined headlines column.
- tokenize(text) inputs text & outputs list of words.
- split(df) inputs dataframe & outputs x_train, x_test, y_train, y_test.
- Vectorize() implemented as a pipeline of CountVectorizer and TfidfTransformer.
Other function for Web app interface:
def plot_metrics(metrics_list):if 'Confusion Matrix' in metrics_list: st.subheader('Confusion Matrix') predictions = model.predict(x_test) matrix = confusion_matrix(y_test, predictions) st.write("Confusion Matrix ", matrix)if 'Classification_Report' in metrics_list: st.subheader('Classification_Report') predictions = model.predict(x_test) report = classification_report(y_test, predictions) st.write("Classification_Report ", report)if 'Accuracy_Score' in metrics_list: st.subheader('Accuracy_Score') predictions = model.predict(x_test) score = accuracy_score(y_test, predictions) st.write("Accuracy_Score: ", score.round(2))
As for the conclusion, we started from going through the whole workflow of building a ML project in Jupyter Notebook to finally implement the model as a ML Web app by using Streamlit library.
Random Forest Model & Logistic Regression Model are tested as benchmark models first and only Random Forest Model is used for further refinement steps.
Before refinement, RF model is evaluated by the classification report method and achieves an accuracy of 81%. However, after using refinement method including GridSearch CV, the RF model successfully improves its accuracy by an increment of 3% to 84%.
Finally, as a challenge, we tried to deploy the ML model into a Web app. By using Streamlit, we only need to decompose our workflow into functions. And most of the functions can be copied from the workspace in Jupyter Notebook.
As a conclusion, I sincerely hope this article can be useful to learners in the domain of Data Science & Machine Learning.
Thank you for your time reading this article!
See you next time.