
NLP is a sub-field of AI which enables computers understand & process human generated text data. In this blog we will learn the basic tasks of NLP and also some applications of NLP.
Once we have text, first task that is performed is to pre-process the data.
Sentence Segmentation
Break the text into individual sentences.
Tokenization
Creating words/vocabulary/token from sentence.
Stop-words removal
Remove most common and not so important words e.g. the, a , an, of, in
Stemming/Lemmatization
Stemming — removing affixes and keeping stem
Lemmatization — Finding the root form of word.
Standardization of text
Domain specific cleaning. Depending on the domain of text corpus.
Noise removal
Punctuation removal, Numbers removal
NLP outcome are not that visually intuitive, to make that interesting we rely on certain visualization methods to present the results. Following are few of those techniques.
Word Cloud
Key Phrases
Text Network
Parts-of-Speech Tagging
Before machine can process text, we need to convert the texts into numeric vectors. Following are some models that do so.
Bag of words
In this method we represent a sentence in a vector which says whether that word is present in a corpus or bag.
TF-IdF
It improves upon bag of word approach. It penalize the words which are more common and does not have much info with them. Like if a word appears in all docs then that is not so important. That is done by taking inverse document frequency which is defined as:
log(number of documents/number of documents containing the word)
Word Embedding
Representation for word which conveys meaning, semantic relationship and context.
Language Models
Word2Vec, BERT, ELMO
Let us understand the NLP’s few prominent applications. The text is from this page.
Key Phrase Extraction
Understand the relative prominence of the Key Phrases within the text. Gives a high level idea of what the text is about.
Sentiment analysis
Understanding the sentiment (e.g. positive, negative, neutral, angry, enthusiastic) about a given subject from text.
Topic Modelling
Discovering hidden semantic structures or abstract concepts in documents
Contextual Search
Retrieve the documents which are contextually and semantically similar to the user’s query.
Text Summarization
Abstractive Summarization — Generate new sentence that convey the meaning of the original text in a smaller number of sentences.
Extractive Summarization — Important sentence from the original text are identified and extracted.
Entity Recognition
Locate and classify named entities in a text into pre-defined categories.
Happy understanding!!