There are several steps involved in sentiment analysis:
- Data collection.
- Data analysis.
- Indexing.
- Delivery.
Data Collection
- Public sentiments from consumers expressed on public forums are collected like Twitter, Facebook, and so on.
- Opinions or feelings/behaviors are expressed differently, the context of writing, usage of slang, and short forms.
Data Analysis
The data analysis process has the following steps:
1. Text Preparation
- Data is extracted and filtered before doing some analysis.
- Non-textual content and the other content is identified and eliminated if found irrelevant.
2. Sentiment Detection
- Each sentence and word is determined very clearly for subjectivity.
- Sentences with subjective information are retained, and the ones that convey objective information are discarded.
Indexing
- Sentiments can be broadly classified into two groups positive and negative.
- Each subjective sentence is classified into the likes and dislikes of a person.
Delivery
- It is the last stage involved in the process.
- The result is converting unstructured data into meaningful information.
- They are displayed as graphs for better visualization.
In sentiment analysis, we use polarity to identify sentiment orientation like positive, negative, or neutral in a written sentence. Fundamentally, it is an emotion expressed in a sentence.
Based on the rating, the “Rating Polarity” can be calculated as below:
df['Rating_Polarity'] = df['Rating'].apply(lambda x: 'Positive' if x > 3 else('Neutral' if x == 3 else 'Negative'))
Essentially, sentiment analysis finds the emotional polarity in different texts, such as positive, negative, or neutral. There are two different methods to perform sentiment analysis:
- Lexicon-based method
- Machine Learning method
Lexicon-based method
Lexicon-based sentiment analysis calculates the sentiment from the semantic orientation of words or phrases present in a text.
The lexicon-based method has the following ways to handle sentiment analysis:
Dictionary
It creates a dictionary of positive and negative words and assigns positive and negative sentiment values to each of the words. Its dictionary of positive and negative values for each of the words can be defined as:
Thus, it creates a dictionary-like schema such as:
Based on the defined dictionary, the algorithm’s job is to look up text to find all well-known words and accurately consolidate their specific results. Sometimes it applies grammatical rules like negation or sentiment modifier.
For instance, applying sentiment analysis to the following sentence by using a Lexicon-based method:
“I do not love you because you are a terrible guy, but you like me.”
Consequently, it finds the following words based on a Lexicon-based dictionary:
- love: +5
- like: +2
- terrible: -1.5
Overall sentiment = +5 + 2 + (-1.5) = +5.5
Accordingly, this sentiment expresses a positive sentiment.
Dictionary would process in the following ways:
Machine Learning method
The machine learning method is superior to the lexicon-based method, yet it requires annotated data sets. It requires a training dataset that manually recognizes the sentiments, and it is definite to data and domain-oriented values, so it should be prudent at the time of prediction because the algorithm can be easily biased.
If the algorithm has been trained with the data of clothing items and is used to predict food and travel-related sentiments, it will predict poorly. Therefore, sentiment analysis is highly domain-oriented and centric because the model developed for one domain like a movie or restaurant will not work for the other domains like travel, news, education, and others.
The following machine learning algorithms are used for sentiment analysis:
- Feature extraction.
- Tokenization.
- SVM.
- Naive Bayes.
- MaxEnt.
Feature Extraction
The feature extraction method takes text as input and produces the extracted features in any form like lexico-syntactic or stylistic, syntactic, and discourse-based. Primarily, it identifies those product aspects which are being commented on by customers.
Tokenization
Tokenization is a process of splitting up a large body of text into smaller lines or words. It helps in interpreting the meaning of the text by analyzing the sequence of the words.
For example:
“This movie is really good.”
After applying tokenization:
[This, movie, is, really, good]
Note: MaxEnt and SVM perform better than the Naive Bayes algorithm sentiment analysis use-cases.
Sentiment analysis is fascinating for real-world scenarios. However, it faces many problems and challenges during its implementation.
Below are the challenges in the sentiment analysis:
- It is tough if compared with topical classification with a bag of words features performed well.
- In many cases, words or phrases express different meanings in different contexts and domains.
Other challenges of sentiment analysis:
- The main challenge in Sentiment analysis is the complexity of the language.
- Negation has the primary influence on the contextual polarity of opinion words and texts. Negation phrases such as never, none, nothing, neither, and others can reverse the opinion-words’ polarities.
- Puzzled sentences and complex linguistics. e.g., “Admission to the hospital was complicated, but the staff was very nice even though they were swamped.” Therefore, here → (negative → positive → implicitly negative)
These are some problems in sentiment analysis:
- It is challenging to answer a question — which highlights what features to use because it can be words, phrases, or sentences.
- How to interpret features? It can be a bag of words, annotated lexicons, syntactic patterns, or a paragraph structure.
Before applying any machine learning or deep learning library for sentiment analysis, it is crucial to do text cleaning and/or preprocessing. It is essential to reduce the noise in human-text to improve accuracy. Data is processed with the help of a natural language processing pipeline.
These steps are applied during data preprocessing:
- Normalizing words.
- Removing stop words.
- Tokenizing sentences.
- Vectorizing text.
Nowadays, online shopping is trendy and famous for different products like electronics, clothes, food items, and others. For instance, e-commerce sells products and provides an option to rate and write comments about consumers’ products, which is a handy and important way to identify a product’s quality. Based on them, other consumers can decide whether to purchase a product or not. It is also beneficial to sellers and manufacturers to know their products’ sentiments to make their products better.
Code implementation in deep learning:
Import all required packages:
import pandas as pd
import numpy as np
import seaborn as sns
import re
import string
from string import punctuation
import nltk
from nltk.corpus import stopwordsnltk.download("stopwords")
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping
Read data:
df = pd.read_csv('women_clothing_review.csv')df.head()
Drop unnecessary columns:
df = df.drop(['Title', 'Positive Feedback Count', 'Unnamed: 0', ], axis=1)df.dropna(inplace=True)
Calculate Rating Polarity based on the rating of dresses by old consumers:
Apply the following rules:
- If the existing rating > 3 then polarity_rating = “Positive”
- If the existing rating == 3 then polarity_rating = “Neutral”
- If the existing rating < 3 then polarity_rating = “Negative”
Code implementation based on the above rules to calculate Polarity Rating:
df['Polarity_Rating'] = df['Rating'].apply(lambda x: 'Positive' if x > 3 else('Neutral' if x == 3 else 'Negative'))
Visualization
Plotting the rating count visualization:
sns.set_style('whitegrid')sns.countplot(x='Rating',data=df, palette='YlGnBu_r')
Plot the Polarity rating count graph:
sns.set_style('whitegrid')sns.countplot(x='Polarity_Rating',data=df, palette='summer')
Data Preprocessing
df_Positive = df[df['Polarity_Rating'] == 'Positive'][0:8000]df_Neutral = df[df['Polarity_Rating'] == 'Neutral']df_Negative = df[df['Polarity_Rating'] == 'Negative']
Sample negative and neutral dataset and create a final dataset:
df_Neutral_over = df_Neutral.sample(8000, replace=True)df_Negative_over = df_Negative.sample(8000, replace=True)df = pd.concat([df_Positive, df_Neutral_over, df_Negative_over], axis=0)
Text Preprocessing:
def get_text_processing(text):
stpword = stopwords.words('english')
no_punctuation = [char for char in text if char not in string.punctuation]
no_punctuation = ''.join(no_punctuation)
return ' '.join([word for word in no_punctuation.split() if word.lower() not in stpword])
Apply the method “get_text_processing” into column “Review Text”:
df['review'] = df['Review Text'].apply(get_text_processing)df.head()
It filters out the string punctuations from the sentences.
Visualize Text Review with Polarity_Review column:
df = df[['review', 'Polarity_Rating']]df.head()
Apply One hot encoding on negative, neural, and positive:
one_hot = pd.get_dummies(df["Polarity_Rating"])df.drop(["Polarity_Rating"], axis=1, inplace=True)df = pd.concat([df, one_hot], axis=1)df.head()
Apply train test split:
X = df["review"].values
y = df.drop("review", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=42
)
Apply vectorization:
vect = CountVectorizer()
X_train = vect.fit_transform(X_train)
X_test = vect.transform(X_test)
Apply frequency, inverse document frequency:
tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)
X_train = X_train.toarray()
X_test = X_test.toarray()
Build a Model with Deep Learning
Add different layers to models:
model = Sequential()
model.add(Dense(units=12673, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=4000, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=500, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=3, activation="softmax"))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
early_stop = EarlyStopping(monitor="val_loss", mode="min", verbose=1, patience=2)
Fit the model:
model.fit(
x=X_train,
y=y_train,
batch_size=256,
epochs=100,
validation_data=(X_test, y_test),
verbose=1,
callbacks=early_stop,
)
Evaluation of Model
Evaluation of the model:
model_score = model.evaluate(X_test, y_test, batch_size=64, verbose=1)
print("Test accuracy:", model_score[1])
Prediction of Result
preds = model.predict(X_test)preds
These are some of the famous Python libraries for sentiment analysis:
- NLTK ( Natural Language Toolkit).
- SpaCy.
- TextBlob.
- Standford CoreNLP.
There are many applications where we can apply sentimental analysis methods. Some of these are:
- Market monitoring.
- Keeping track of feedback from the customers.
- Helps in improving the support to the customers.
- Keeping an eye on the competitors.
- Used in Recommendation systems.
- Display of ads on webpages.
- Filtering spam of abusive emails.
- Psychological evaluation.
- Online e-commerce, where customers give feedback.
- Sentiment analysis in social sites such as Twitter or Facebook.
- Understand the broadcasting channel-related TRP sentiments of viewers.
Sentiment analysis aims at getting sentiment-related knowledge from data, especially now, due to the enormous amount of information on the internet. In other words, we can generally use a sentiment analysis approach to understand opinion in a set of documents.
Sentiment analysis is sometimes referred to as opinion mining, where we can use NLP, statistics, or machine learning methods to extract, identify, or otherwise characterize a text unit’s sentiment content.
Consumers can use sentiment analysis to research products and services before a purchase. Public companies can use public opinions to determine the acceptance of their products in high demand.
For example, moviegoers can look at a movie’s reviews and then decide whether to watch a movie or not. Perceiving a sentiment is natural for humans. Also, sentiment analysis can be used to understand the opinion in a set of documents. Hence, Sentiment analysis is a great mechanism that can allow applications to understand a piece of writing’s underlying subjective nature, in which NLP also plays a vital role in this approach.