Personality Classification from US Election Tweets using Machine Learning

Personality refers to the characteristic pattern of behaviors, thinking, and emotional patterns of an individual. In this project, our goal is to train a personality prediction model on the MBTI dataset and later use this model to analyze the relationship between personality and political views. Personality recognition is a very important topic as it can help identify the behavior and emotional patterns of a person. Personalities can be very helpful for job profiling, matrimonial profiling, etc. Our project shows one of the ways in which personality traits can be used to generate useful inferences that can allow the candidates to focus more on some personality groups which do not like them. Similar work can also be done to see the effect of personalities in the hiring process, bullying, and harassment detection, etc.

As we all know that US elections are over and this time Joe Biden is elected as the president so we were quite curious to figure out what is the personality of those who favored Joe Biden and those who were against Donald Trump.

The whole project was divided into two phases:

Phase 1

In this phase, we have used the standard MBTI Dataset available on Kaggle for training our Personality Prediction Model (let us name it Model 1).

Phase2

In this phase, Model 1 was improved by using a different classification approach (let us name it Model 2). In the improved approach, the Personality Prediction is done using 4 binary models each for classifying a particular dimension that is I/E, S/N, T/F, J/P. User tweets were extracted using Twitter API based on some common hashtags from US Election 2020 like #USElections, #Biden, #Trump, #MAGA, #USElections2020, etc and sentiment analysis was performed over the extracted tweets to separate out positive and negative sentiment tweets for both Trump and Biden. Finally, the positive and negative sentiment user tweets for both Trump and Biden were given as input to our Model 2 for predicting the personalities of users based on their tweets.

Now, let us discuss about the MBTI dataset. This dataset contains 8600 rows of data. Each row contains a person’s personality Type and a section of each of the last 50 tweets they have posted. Personality type refers to the psychological classification of different types of individuals. The Myers-Brigg type indicator (MBTI) contains different psychological preferences on how people make decisions and perceive the world. This system divides people into 16 distinct personality types by assigning them 4 categories:

· Introversion (I) or Extroversion (E)

· Intuition (N) or Sensing (S)

· Thinking (T) or Feeling (F)

· Judging (J) or Perceiving (P)

The dataset has to be pre-processed before being used for data analysis and model building. After manually processing we see that the data cleaning process needs removing URLs, digits, and special characters. We didn’t see the usage of emoticons in this data so no processing is applied for handling emoticons. After data cleaning, we convert the data into lower case and remove stop words. Next, we used converted chat words to their original words (AFAIK -> As Far As I Know) and performed spelling correction using the python library. After this, we lemmatize the data and this data is now ready for word embedding. Term frequency and inverse document frequency technique (TF-IDF) is used to calculate word embedding. Frequent words i.e. that appear in more than 50% of the documents are ignored and similarly infrequent words i.e. that appear in less than 10% of the documents are ignored. This step has reduced the dimension of the word embed and thus the model has to be trained on smaller but relevant features.

Data Pre-Processing

The class distribution of the entire dataset has been visualized in the form of a bar plot given below, showing the number of instances in a particular MBTI class.

Class-wise Distribution of users

In the above bar plot, the x-axis denotes the MBTI personality types and the y-axis shows the number of instances (users) for each class. The exact number of instances in each class is reported in the Table given below. It can be seen from the above bar plot and Table below that the classes are unevenly distributed which means that the dataset is very imbalanced. The classes having I (Introvert) as a trait in them generally have a large number of instances as compared to E (Extrovert). Similarly, there are more users with trait N (Intuition) compared to trait S (Sensing).

Exact Distribution of users and classes

The skewness observed in the given dataset is undesirable as it leads to the model being biased towards predicting the majority class for any unknown instance. So we need to address this class imbalance by applying some resampling techniques over the given dataset. For this dataset, we have used random oversampling procedure which basically duplicates minority instances to enhance the imbalance proportion. This duplication of minority class removes the large skewness in our dataset. The figure below shows the bar plot after applying Random Oversampling on our dataset.

Class-wise Distribution after Oversampling

After pre-processing the dataset, we divided our dataset into training features and corresponding labels (ground truth). We train different machine learning models and then test their performance on the test set. We have used two different approaches to train our Personality Prediction Model which have been illustrated in the following sub-sections.

Basic Approach

In the basic approach, we use a single machine learning model to perform the classification of different personality types. The figure below shows the flow of training a model using this approach. The raw tweets in the MBTI dataset are pre-processed followed by feature extraction and training the model on them.

Baseline Model for Personality Prediction

Improved Model1 for Personality Prediction: Baseline with oversampling of data

We have applied standard Machine Learning models SVM, Logistic Regression, and Multi-Layer Perceptron. We were also able to improve the accuracy by oversampling the dataset.

Improved Approach

The improved approach for training a Personality Prediction model is an enhancement of the previously described and commonly used basic approach. In the improved approach, the Personality Prediction is done using 4 binary models each for classifying a particular dimension that is I/E, S/N, T/F, J/P. The figure below shows the flow of training a model using this approach. After performing preprocessing over raw tweets and extracting features from them, the 4 binary models created for independent classification are trained on the preprocessed data and the trained models are further evaluated on the original MBTI dataset and the result of each model is combined to generate the final personality class predicted.

Improved Model 2 for Personality Prediction: 4 Binary Classifiers with oversampling of data

We have applied standard Machine Learning models SVM, Logistic Regression, and Multi-Layer Perceptron to train each of the 4 binary models. We were also able to improve the accuracy by oversampling the dataset. The results of all three models are summarized in the table below.

Comparison of Results for Personality Prediction Model

It is clearly evident from the results obtained using both the basic and improved approach that the best accuracy achieved was using the improved approach to train four binary MLP models with random oversampling to achieve an accuracy of 97.25% on MBTI Dataset. We will use this trained Personality Prediction Model with four binary classifiers to make predictions on the extracted US Election 2020 tweets.

In this work, we targeted the event of U.S. Elections 2020 because it was one of the most trending topics on Twitter. Our goal was to analyze the types of personalities of people who like or hate each candidate. The figure below shows the proposed architecture. In the final architecture, it can be seen that we first extracted US Election tweets and applied sentiment analysis of these extracted tweets to see the sentiments of users towards a particular candidate. Based on these sentiments, we used the trained Personality Prediction Model to predict the personality of the person with similar sentiments.

The architecture of sentiment-based user (US Election) Personality Prediction

Twitter allows the mining of Twitter data using Tweepy or Twitter API. There are two basic steps that need to be performed before the extraction of tweets from Twitter, namely obtaining Twitter API keys and connecting to Twitter API. In the first step, we created a Twitter Developer Account and made a request to Twitter for providing access keys and consumer keys to make a successful connection. Then once Twitter approves the request, it provides us with consumer key, consumer secret, access key, and access secret which are needed to connect with the Twitter API. After all these steps, we started extraction of Tweets based on some common hashtags from US Election 2020 like #USElections, #Biden, #Trump, #MAGA, #USElections2020 etc. While extraction of tweets we filtered out users on the basis of the total number of tweets, the number of followers, and their tweet language. We stored the username, followers, total number of tweets, text, and hashtags for each tweet. We performed this tweet extraction for 2 weeks to create a database of around 36000 tweets related to US Elections.

We filtered out the extracted US Election tweets to create separate databases for Biden and Trump using simple keyword matching. Then we used the same preprocessing techniques as used before for the MBTI dataset to filter out unwanted information from the tweets. We experimented with different sentiment analysis models on our data and found that the Flair model is better than the other 2 models Vader and Textblob. The figure below shows a sample of Trump-related tweets and it can be clearly seen that all the tweets actually have highly negative sentiment toward Trump which is only predicted by Flair. The reason behind the poor performance of Vader and TextBlob is that they are rule-based, they use a list of lexical features (e.g. word) which are labeled as positive or negative according to their semantic orientation to calculate the text sentiment. These models do not use machine learning to calculate the text sentiment. Textblob ignores the words which are unknown to it, Vader is optimized for the social media data and gives good results compared to Textblob. The main drawback with the rule-based approach for sentiment analysis is that these methods only care about individual words and completely ignore the context in which it is used. In recent trends, researchers have found that word embeddings are performing better than traditional word representation.

In-text embeddings similar words are represented by similar vectors that are close to each other. Flair is a simple python package that uses this form of text representation to predict text sentiment and as a result, it provides better prediction.

Subjective Evaluation of results by different Sentiment Analysis models

From Sentiment Analysis, we got information about the users who had positive and negative sentiment towards each of the candidates. Based on these sentiments, we extracted around 100 tweets for each user so that we can test our Personality Prediction model. So we again used the obtained Twitter keys to extract tweets from users who have public profiles and have more than 100 tweets. We also filtered out retweets and non-English language tweets.

Now, we have four databases corresponding to positive and negative sentiment towards each candidate with around 100 tweets for each user. We then used our saved Personality Prediction MLP model developed using an improved approach to predict the personalities of the users.

We used the extracted tweets on our trained personality prediction model to generate the personality types which have a positive and negative sentiment towards both the candidates.

Distribution of Personality types of users with positive sentiment for Trump

Distribution of Personality types of users with negative sentiment for Trump

Distribution of Personality types of users with positive sentiment for Biden

Distribution of Personality types of users with negative sentiment for Biden

The figures above show bar plots of the personality types which have a positive and negative sentiment towards Donald Trump and bar plots of the personality types which have a positive and negative sentiment towards Joe Biden.

we tried to generate analysis by focusing on certain aspects of the predictions. We looked at the separate characteristics of each personality type which are described below:

· How do you prefer to take in information? S vs. N

· How do you prefer to make decisions? T vs. F

· Are you outward or inward focused? E vs. I

· How do you prefer your outer life? J vs. P

Statistical Analysis of Model results

The figure above contains pie charts of positive and negative sentiment users with different aspects of their personalities for both the candidates. It was observed that Biden and Trump both are liked by I, S, and T types of personalities who are associated with being bold, fact-minded, practical logicians. Biden is disliked by I, S, and J, who are generally responsible and warm. Trump is disliked by E, N, and J types of personality which are associated with popular and organized people who are suitable to be leaders.

Rajat Agarwal (https://in.linkedin.com/in/rajatag27)

Implementation of Basic Personality Prediction Model, Better approaches of Feature Extraction like Glove, Bert, etc., and Sentiment Analysis of US Election Tweets.

Harshit Singh Chhabra (https://www.linkedin.com/in/harshit-chhabra)

Extraction of US Election tweets and its preprocessing, extraction of user’s last 100 tweets and its preprocessing, and Testing the saved Personality Prediction model on extracted user tweets to generate results.

Kartikey Arora (https://www.linkedin.com/in/kartikey-arora-714b23140)

MBTI Data preprocessing and analysis, Resampling of the MBTI Dataset to handle class imbalance problem and, implementation of the Improved Personality Prediction Model.

Professor: https://www.linkedin.com/in/tanmoy-chakraborty-89553324/
Prof. Website: faculty.iiitd.ac.in/~tanmoy/

2. Teaching Fellow: Ms. Ishita Bajaj

3. Teaching Assistants: Pragya Srivastava, Shiv Kumar Gehlot, Chhavi Jain, Vivek Reddy, Shikha Singh, and Nirav Diwan.

1] Bharadwaj, Srilakshmi, et al. “Persona Traits Identification based on Myers-Briggs Type Indicator (MBTI)-A Text Classification Approach.” 2018 International Conference on Advances in Computing, Communications, and Informatics (ICACCI). IEEE, 2018.

2] Gjurković, Matej, and Jan Šnajder. “Reddit: A gold mine for personality prediction.” Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media. 2018.

3] Pratama, BayuYudha, and RiyanartoSarno. “Personality classification based on Twitter text using Naive Bayes, KNN, and SVM.” 2015 International Conference on Data and Software Engineering (ICoDSE). IEEE, 2015.

4] Balakrishnan, Vimala, et al. “Cyberbullying detection on Twitter using Big Five and Dark Triad features.” Personality and individual differences 141 (2019): 252–257.

5] Moraes, Roshal, et al. “Personality Assessment Using Social Media for Hiring Candidates.” 2020 3rd International Conference on Communication System, Computing, and IT Applications (CSCITA). IEEE, 2020.

Footer