FlashText — A better alternative of Regex for NLP tasks
Natural Language Processing (NLP) is a subfield of artificial intelligence concerned with interactions between computer and natural human languages. NLP involves text processing, text analysis, apply machine learning algorithms to text and speech, and many more.
Text processing is a key element in the pipeline of NLP or a text-based data science project. Regular expressions are used for a variety of purposes such as feature extraction, string replacement, and other string manipulations. Regular Expressions are also known as regex is a tool available with many programming languages and also too with many python libraries.
Regex is basically a set of characters or patterns, which is used to substring a given string, that can further used to search, extract, substitute, or other string operations.
FlashText is an open-source python library that can be used to replace or extract keywords in text. For the NLP project, we encounter several text processing tasks whether word replacement and extraction are required, FlashText library enables developers to perform extraction and replacement of keywords effectively.
Installation:
FlashText library can be installed using PyPl:
pip install flashtext
Usage:
FlashText library has limited usage, it’s restricted to extract keywords, replace keywords, Get extra information about the extracted keyword, remove keywords. In the sample notebook below, you can find code snippets calculating and comparing benchmark numbers between FlashText and RE for extracting and replacing keywords from a text taking from Wikipedia.
Keywords extraction and replacement are performed using RE and FlashText library for a text document (having around 500 words) taken from the Wikipedia page of Machine Learning.
You can observe the benchmark time numbers between the two libraries, performed for two tasks: keyword extraction and replacement. The tasks were performed for a small length of text of around 500 words. The difference in time numbers is very small and hence the performance is indistinguishable.
The below plot represents the time number for 1000 keywords replace operation for a text document having 10,000 tokens. It can be observed that FlashText operations are about 28x faster compared to Regex.