
To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a Task and Pre-Processing is one of the 6 steps of CRISP-DM (cross-industry process for data mining) Methodology.
In NLP, text preprocessing is the first step in the process of building a model.The various text preprocessing steps are:
- Tokenization
- Lower casing
- Stop words removal
- Stemming
- Lemmatization
Part of this package creation, we aimed to perform the data cleansing and exploratory analysis of text data easier for the users by including various functionalities in single function. Idea is to extend these functionalities to perform sentiment analysis of any given text data
1. Basic Functionalities:
a. Unique Words count: Number of Uniquely Identified words.
b. Different Characters count
c. Count of stop words
d. System special characters
e. Tokenization: The benefit of Tokenization is that it gets the text into a format that’s easier to convert to raw numbers, which can actually be used for processing. It’s a natural first step when analyzing text data.
f. Number of hashtags (New column will be created and #tags would be stored)
g. Number of numeric characters
h. Number of uppercase words (Generally uppercase words are used to denote a sentiment)
i. Number of emojis
2. Pre-Processing Functionalities:
a. Lower casing: Conversion to lowercase texting.
b. Punctuation removal (Remember, hashtags are stored in a separate column in previous function)
c. Stop words removal: Stop words removal can be easily done by removing words that are in a pre-defined list. An important thing to note is that there is no universal list of stop words. As such, the list is often created from scratch and tailored to the application being worked on.
d. Standardizing text using look up dictionary
e. Spelling correction: Based on Standard Dictionary Spell corrections are suggested.
f. Topic Modeling: Essentially, it’s a form of Dimhttps://www.aha.video/movies/maa-vintha-gaadha-vinumaensionality Reduction since we’re reducing a large amount of text data down to a much smaller number of topics. Topic modeling can be useful in a number of Data Science scenarios
g. Frequent words — Based on user input ’n’, visualization of word cloud for n — frequently used words would be given as output
h. Rare words — Based on user input ’n’, visualization of word cloud for n — rare words would be given as output
i. Replacing emojis with text
j. Stemming — Stemming is the process of reducing words into their root form. The purpose of this is to reduce words which are spelled slightly differently due to context but have the same meaning, into the same token for processing. Different stemming algorithms would be given as optional arguments from which the user can choose
k. Lemmatization: Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word