NLP is a sub-field of AI which enables computers understand & process human generated text data. In this blog we will learn the basic tasks of NLP and also some applications of NLP.
Once we have text, first task that is performed is to pre-process the data.
Break the text into individual sentences.
Creating words/vocabulary/token from sentence.
Remove most common and not so important words e.g. the, a , an, of, in
Stemming — removing affixes and keeping stem
Lemmatization — Finding the root form of word.
Standardization of text
Domain specific cleaning. Depending on the domain of text corpus.
Punctuation removal, Numbers removal
NLP outcome are not that visually intuitive, to make that interesting we rely on certain visualization methods to present the results. Following are few of those techniques.
Before machine can process text, we need to convert the texts into numeric vectors. Following are some models that do so.
Bag of words
In this method we represent a sentence in a vector which says whether that word is present in a corpus or bag.
It improves upon bag of word approach. It penalize the words which are more common and does not have much info with them. Like if a word appears in all docs then that is not so important. That is done by taking inverse document frequency which is defined as:
log(number of documents/number of documents containing the word)
Representation for word which conveys meaning, semantic relationship and context.
Word2Vec, BERT, ELMO
Let us understand the NLP’s few prominent applications. The text is from this page.
Key Phrase Extraction
Understand the relative prominence of the Key Phrases within the text. Gives a high level idea of what the text is about.
Understanding the sentiment (e.g. positive, negative, neutral, angry, enthusiastic) about a given subject from text.
Discovering hidden semantic structures or abstract concepts in documents
Retrieve the documents which are contextually and semantically similar to the user’s query.
Abstractive Summarization — Generate new sentence that convey the meaning of the original text in a smaller number of sentences.
Extractive Summarization — Important sentence from the original text are identified and extracted.
Locate and classify named entities in a text into pre-defined categories.