At times of great inspiration, words just flow through our keyboard. Proficient writers can produce several posts in a very short time.
As a writer, you might once have experienced the feeling of being unstoppable, capable of generating one piece of writing after another.
After taking a break from writing, I asked myself a few questions: what is that I have been writing about? Were my publications scattered over the places, or did they cover the same topics repeatedly? What are the keywords which would best describe my content and find my true audience?
I realized that my questions related to the topic-aspect of publishing. One day I recalled a vivid memory from my former literature mentor at high school. Mrs. Dembawai always reminded us:
“An essay should have an introduction, a body, and a conclusion. But above all, it should have a topic.”
Did my posts convey a key message for my audience? Could I easily put them together into a book that makes sense to readers? This is exactly what our mentor tried to find out.
One day, she asked the whole class to write about anything that crosses our minds. The essays would be collected and handed out to students of another class. Their task would be to analyze the writings and to find out the overall topics covered in our essays, the key messages.
We were overly surprised by this idea. Normally, academic essays are responses to questions or directions. This time, we were dealing with a reverse journey. There was no directive words, no topic keywords to guide us. We were left alone in an empty playground of imagination, without any boundaries.
Most people were trained to write around a given narrow topic at school. Yet, our most creative thoughts appear chaotic, disorganized, like random sparks popping at the surface of boiling water. How can we possibly structure everything that we think of?
Mrs. Dembawai’s little experiment set to find out, if there is an overall topic that governs the written thoughts of a class of students. If yes, could that topic be reconstructed by a group of reviewers?
Discovering the abstract topics that occur in a collection of texts or documents is a focus area of AI called Topic Modeling. The goal is to find out the most representative count of topics and their most relevant keywords.
Topic Modeling has a wide-range area of applications in industries. It can help writers to answer the following questions:
- is my content coherent, or is it covering too many unrelated topics?
- which keywords best describe my content?
- how different is my content from others?
When we are interested in classifying a set of posts based on topic, we often do not want to look at the sequential pattern of words. Rather we would represent the text as a bag of words, as if it were an unordered set of words while ignoring their original position in the text, keeping only their frequency.
As you write a post, you use certain words more often to express a common idea. Your post is, therefore, a mixture of topics, each of them consists of a collection of keywords.
Below you can see the most frequent keywords I utilized in my posts.
If you care for performing Mrs. Dembawai’s experiment on a set of posts, you are compelled to go beyond counting words. Topics in a text are related to the semantics, therefore they are actually being governed by some hidden variables that we are not observing in the text.
In Topic Modeling we will typically use a weight to assign importance to more discriminative keywords, instead of raw count. Intuitively, a keyword has a large weight when it occurs frequently across a post but infrequently across all posts.
After applying this technique, the most informative or useful keywords for identifying topics in my posts can be seen below. Higher values indicate that a keyword is more useful for identifying a specific topic.
If we look at posts as a distribution of topics, and topics themselves as a distribution of keywords, then we can try to find the distributions that would generate the original posts with the highest coherence.
By running this method on my posts, the results suggest that 3 to 8 major topics were covered, as illustrated by the colored lines below.
By drilling through the posts, I could agree on classifying my content into 4 major topics.
By applying Topic Modeling, I could extract human-interpretable topics from my posts, where each topic is characterized by the keywords they are most strongly associated with.
Topic 1 is represented by keywords such as word, model, language, sequence, attention, sentence, text, transformer, etc. Topic 1 seems to relate to Natural Language Processing, a field of AI dealing with text and speech processing.
Topic 2 is characterized by keywords such as network, neural, learn, generator, weight, step, agent, etc. This is obviously related to Deep Learning and Reinforcement Learning. Both fields are parts of a broader family of machine learning methods.
Topic 3 is distinguished by keywords such as distribution, probability, posterior, estimate, variable, random, prior, etc. This brings us towards Bayesian Statistics, an approach to statistical inference taking uncertainty into account.
Topic 4 is characterized by a mixture of unrelated terms such as network, image, cell, biophoton, generator, celebrity, etc. Topic 3 seems to cover various fields ranging from Physics to Generative Adversarial Networks.
I was personally thrilled by the results obtained from analyzing my own posts. The content was very specialized around a few highly technical topics.
Although my past posts ranked me among the Top 50 Writers about AI on Medium, this analysis inspired me to diversify my future writing to appeal to a less technical audience.
Discovering the abstract topics that occur in a collection of posts can help writers answering the question of what to write about next.
With the insights, writers can combine the keywords to find a representative title or an abstract for their content. They can also edit their work to be more coherent around its main topic and to outstrip standards.
For example, given a new post, we can obtain its topic mixture, e.g. 5% topic 1, 70% topic 2, 10% topic 3, and 15% topic 4. This insight is often very useful for downstream applications, such as finding the right ad-words and publication for promoting content.
At the beginning of this article, I described Mrs. Dembawai’s experiment, where students of a class hand-over a bunch of essays without any topic to students of another class, whose task was to recover the topics.
Do you wish to know what happened next? Stay tuned.