How a data scientist deals with text
This article aims to explore the most common use cases a data scientist faces when dealing with Natural Language Processing (NLP). Setting the scene, we assume you are new to NLP, but not necessarily in data science.
As data scientists, we often have to work with unstructured data. If you are not familiar with the term, then think of structured data as a table in Excel or in an actual database, where each column is a feature or an attribute. So, the opposite of that (unstructured) is when we do not have this format. Usually, this means we have to deal with either images or text. NLP deals with the computation of text, and it is about making a computer understand our written language. Our overall objective or universal goal, if you like, is to create models that understand text similar to how a human would do. We are not there yet, but we can create models that are useful in different settings. Typically, as data scientists, we face three common tasks related to the text.
Topic modeling
Topic modeling is required when we have lots of documents, and we want to “automatically” group them into different sets (clusters), each related to a specific topic. Arguably, the most famous approach is Latent Dirichlet Allocation (LDA). In short, LDA assumes that documents have a mixture of topics, and thus each topic can generate words according to their distribution. So, when we use LDA, it basically tells us which topics created each document.
This is useful when we have no specific information about a given set of documents, and we want to create some groups that we suspect or want our documents to be grouped in.
For more information on topic modeling, you can check out here and/or here.
Name Entity Recognition
Name Entity Recognition (NER) is the task of identifying entities that are referred to in a document. For example, there might be names, companies, addresses, products, etc. To achieve this at a high level of accuracy, you typically need an enormous amount of annotated training data, where a machine learning model can learn all the different entities based on the context in a given document. This means that someone has to manually go through a lot of documents and annotate them every time an entity we are interested in is present.
At the heart of this system, there are usually two steps: firstly, the model needs to understand whether a word or words are an entity or not. Given that they are an entity, then categorize it correctly.
A very nice blog to understand this further can be found here.
Also, here is a blog focusing more on the implementation side using python and relevant libraries.
Text Classification
NER includes text classification, but often the most common task is just to predict something about a given text. This is similar to any data science use case where you train a machine learning model on relevant data, in this case, text and try to predict something out of a new unseen text.
Oversimplifying this, as in any other machine learning supervised task, given a set of features with the corresponding labels, we can create models to predict those labels. In this case, our features are extracted from the text. Given that text can be represented in many ways, deep learning approaches are frequently used, especially given that usually there are opportunities to train on large volumes of data.
A good source to get going is the following.
As we mentioned, a deep learning approach works best, especially with a large volume of data. This example classifies fake news using a deep learning approach.