• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

Introduction to NLP — Package Creation for Pre Processing

February 20, 2021 by systems

To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a Task and Pre-Processing is one of the 6 steps of CRISP-DM (cross-industry process for data mining) Methodology.

In NLP, text preprocessing is the first step in the process of building a model.The various text preprocessing steps are:

  1. Tokenization
  2. Lower casing
  3. Stop words removal
  4. Stemming
  5. Lemmatization

Part of this package creation, we aimed to perform the data cleansing and exploratory analysis of text data easier for the users by including various functionalities in single function. Idea is to extend these functionalities to perform sentiment analysis of any given text data

1. Basic Functionalities:

a. Unique Words count: Number of Uniquely Identified words.

b. Different Characters count

c. Count of stop words

d. System special characters

e. Tokenization: The benefit of Tokenization is that it gets the text into a format that’s easier to convert to raw numbers, which can actually be used for processing. It’s a natural first step when analyzing text data.

f. Number of hashtags (New column will be created and #tags would be stored)

g. Number of numeric characters

h. Number of uppercase words (Generally uppercase words are used to denote a sentiment)

i. Number of emojis

2. Pre-Processing Functionalities:

a. Lower casing: Conversion to lowercase texting.

b. Punctuation removal (Remember, hashtags are stored in a separate column in previous function)

c. Stop words removal: Stop words removal can be easily done by removing words that are in a pre-defined list. An important thing to note is that there is no universal list of stop words. As such, the list is often created from scratch and tailored to the application being worked on.

d. Standardizing text using look up dictionary

e. Spelling correction: Based on Standard Dictionary Spell corrections are suggested.

f. Topic Modeling: Essentially, it’s a form of Dimhttps://www.aha.video/movies/maa-vintha-gaadha-vinumaensionality Reduction since we’re reducing a large amount of text data down to a much smaller number of topics. Topic modeling can be useful in a number of Data Science scenarios

g. Frequent words — Based on user input ’n’, visualization of word cloud for n — frequently used words would be given as output

h. Rare words — Based on user input ’n’, visualization of word cloud for n — rare words would be given as output

i. Replacing emojis with text

j. Stemming — Stemming is the process of reducing words into their root form. The purpose of this is to reduce words which are spelled slightly differently due to context but have the same meaning, into the same token for processing. Different stemming algorithms would be given as optional arguments from which the user can choose

k. Lemmatization: Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word

Filed Under: Machine Learning

Primary Sidebar

Carmel WordPress Help

Carmel WordPress Help: Expert Support to Keep Your Website Running Smoothly

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy