Polyglot- Python package for NLP operations
Natural Language Processing aims at manipulating the human/natural language to make it understandable for the machine. It deals with text analysis, text mining, sentiment analysis, polarity analysis, etc. There are different python packages that make NLP operations easy and effortless.
All NLP packages have different functionalities and operations which makes it easier for end-user to perform text analysis and all sorts of NLP operations. In this series of articles, we will explore different NLP packages for python and all of their functionalities.
In this article, we will be discussing Polyglot which is an open-source python package used for manipulating text and extracting useful information from it. It has got several functionalities that make it better and easy to use than other NLP-based libraries. Here we will discuss its different functionalities and how to implement them.
Let’s get started.
In order to get started, we first need to install polyglot and all of its dependencies. For this article we will be using Google Colab, the code given below will install polyglot and its dependencies.
!pip3 install polyglot
!pip3 install pyicu
!pip3 install pycld2
!pip3 install morfessor
After installing these libraries we also need to install some functionalities of polyglot which will be used in this article.
!polyglot download embeddings2.en
!polyglot download pos2.en
!polyglot download ner2.en
!polyglot download morph2.en
!polyglot download sentiment2.en
!polyglot download transliteration2.hi
The next step is to import the required libraries and functionalities of polyglot that we will explore in this article.
import polyglot
from polyglot.detect import Detector
from polyglot.text import Text, Word
from polyglot.mapping import Embedding
from polyglot.transliteration import Transliterator
Let us start by exploring some of the NLP functionalities that are provided by polyglot, but before that let us input some sample data that we will be working on.
sample_text = '''Piyush is an Aspiring Data Scientist and is working hard to get there. He stood Kaggle grandmaster 4 year consistently. His goal is to work for Google.'''
- Language Detection
Polyglot’s language detector can easily identify the language in which the text is written.
#Language detection
detector = Detector(sample_text)
print(detector.language)
2. Sentences and Words
In order to extract the sentences or words from the text/corpus, we can use polyglot functions.
#Tokenize
text = Text(sample_text)
text.words
text.sentences
3. POS Tagging
Part of speech tagging is an important NLP operation that helps us in understanding the text and their tagging.
#POS tagging
text.pos_tags
4. Named Entity Recognition
NER is used to identify the person, organization, and location if any in the corpus/text dataset.
#Named entity extraction
text.entities
5. Morphological Analysis
#Morphological Analysis
words = ["programming", "parallel", "inevitable", "handsome"]for w in words:
w = Word(w, language="en")
print(w, w.morphemes)
6. Sentiment Analysis
We can analyze the sentiment of a sentence.
#Sentiment analysistext = Text("Himanshu is a good programmer.")
for w in text.words:
print(w, w.polarity)
7. Translate
We can translate text into different languages.
#Transliteration
transliterator = Transliterator(source_lang="en", target_lang="hi")
new_text = ""
for i in "Piyush Ingale".split():
new_text = new_text + " " + transliterator.transliterate(i)
new_text
This is how you can explore the different properties of Polyglot for text datasets easily without any hassle.
Go ahead try this with different textual datasets, in case you find any difficulty you can post that in the response section.
This post is in collaboration with Piyush Ingale
Thanks for reading! If you want to get in touch with me, feel free to reach me on hmix13@gmail.com or my LinkedIn Profile. You can view my Github profile for different data science projects and packages tutorials. Also, feel free to explore my profile and read different articles I have written related to Data Science.