Using well-established algorithms and models to obtain useful results for topic modeling has never been easier
Topic modeling was not always so easy. Not long ago, the prevalent method for topic modeling was Latent Dirichlet Allocation (LDA). Using LDA with Gensim is simple to code, but the results… I often struggled to get any useful insights. It, therefore, impressed me when I was exploring the use of the module Top2Vec and found it to be so easy to use and the results so useful.
Released in March 2020, Top2Vec is created by Dimo Angelov, with its paper published on ArXiv in August 2020. Despite being new, the algorithms used by Top2Vec are well-established — Doc2Vec, UMAP, HDBSCAN. It also supports the use of embedding models like Universal Sentence Encoder and BERT.
In this article, we shall look at the high level workings of Top2Vec and illustrate the use of Top2Vec through topic modeling of hotel reviews. You can imagine this being useful to the hotel management by organizing the reviews for understanding key aspects of customer feedback, and needless to say, this can be applied to other kinds of reviews too.
Top2Vec automatically detects the topics present in text and does not require traditional text preprocessing like stop-words removal, stemming or lemmatization.
There are three key steps taken by Top2Vec.
1. Transform documents to numeric representations
Given a list of documents, Top2Vec converts each document to a numeric representation (or document vector) through Doc2Vec or a pre-trained model. Similarly, each unique word will also have a numeric representation (word vector).
The numeric representations are created such that the semantic contents of the documents are captured, and similar documents and words will be close to each other. In the example below, we see that the documents (blue points) and words (green points) relating to washroom are close to each other while those related to parking bunch together.
2. Dimensionality reduction
With the numeric representations, we could have proceeded to cluster the documents to find the topics, but document vectors in high dimensional space tend to be sparse. As such, dimensionality reduction is performed using UMAP (Uniform Manifold Approximation and Projection) to help find dense areas. In the paper, the author finds 5 dimensions to give the best results
for the downstream task of clustering.
3. Clustering of documents to find topics
After compressing the numeric representations into a lower dimensional space, we are now ready to find dense areas of documents using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise).
In the diagram below, each dot represents a document and the dense areas of documents are colored, with each area representing a topic. The gray dots are outliers not belonging to any cluster.
For each dense area, a numeric representation of the topic (topic vector) is then obtained by taking the arithmetic mean of all the document vectors in the same cluster. And finally, each document is assigned a topic number based on the nearest topic vector to its document vector.
With a better understanding of Top2Vec, we are now ready to dive into our hotel reviews example.
The dataset that we are using can be downloaded from Kaggle and contains 515,000 customer reviews of luxury hotels across Europe.
It contains various columns but we will just be using one column in our illustration — Negative_Review.
An example of Negative_Review:
Room was not cleaned correctly Wine Champagne glasses left dirty in the room the floors felt dirty and the shower drain was clogged From the day we arrived
After some cleaning up of the data, we are then ready to feed in our list of reviews to Top2Vec.
The code to run the modeling is as follows, where hotel_reviews is our data.
model = Top2Vec(documents=hotel_reviews)
And that’s it! This single line of code will pre-process the documents for training, create document and word vectors using Doc2Vec, perform dimensionality reduction on the vectors, and finally find the topics through clustering.
To use the pre-trained models instead, we just need to add the embedding_model parameter.
model = Top2Vec(documents=hotel_reviews, embedding_model='distiluse-base-multilingual-cased')
In our case, we experimented with Doc2Vec and other pre-trained models for creating the document vectors, and find that “distiluse-base-multilingual-cased” pre-trained model works best for our example.
We are now ready to interpret the topics. Top2Vec has methods that make evaluation easier by generating word clouds and retrieving the top similar documents for each topic.
Let’s look at our top 3 topics.
From the word clouds and sample reviews, we can clearly infer that topic 1 has to do with bad washrooms, topic 2 is about small rooms and topic 3 are complains on parking matters.
All these results were obtained without us telling the computer the topics to look for and with only a few lines of code. Power! 💪
If you would like to try it out on your own, our example notebook can be found here for your reference.