The goal of stemming and lemmatization is to reduce inflection forms and sometimes derivationally related forms of a word to a common base form.
Stemming refers to a heuristic process that chops off ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes
Errors in Stemming:
There are mainly two errors in stemming Over-stemming and Under-stemming.
Over-stemming occurs when two words are stemmed from the same root that is of different stems. The process where a much larger part of a word is chopped off than what is required, which in turn leads to two or more words being reduced to the same root word or stem incorrectly when they should have been reduced to two or more stem words. For example, university and universe.
Under-stemming occurs when two or more words could be wrongly reduced to more than one root word when they actually should be reduced to the same root word. For example, consider the words “data” and “datum.” It may reduce these words to dat and datu respectively, which is obviously wrong. Both of these have to be reduced to the same stem dat.
Applications of stemming are:
- Stemming is used in information retrieval systems like search engines.
2. It is used to determine domain vocabularies in domain analysis.
Lemmatization refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of words, which is known as the lemma
better => good
Text preprocessing includes both stemming as well as Lemmatization. Actually, lemmatization is preferred because lemmatization does a morphological analysis of the words.
Applications of lemmatization are:
- Used in comprehensive retrieval systems like search engines.
2. Used in compact indexing
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
Print the result od stemming and lemmatization of the words
print("Lemmatizer:" ,lemma.lemmatize(word,pos))ps = PorterStemmer()
lemma = WordNetLemmatizer()stem_and_lemma(ps,lemma,word="stripes",pos = 'v')
stem_and_lemma(ps,lemma, word="playing", pos = 'v')word_list = word_tokenize(sample_text)
print(' '.join([ps.stem(w) for w in word_list]))
print(' '.join([lemmatizer.lemmatize(w) for w in word_list]))
Machine Learning models cannot work with the text directly, we need to convert them into vectors of the number, this step is called feature extraction.
Bag-of-words are a popular and simple feature extraction technique used when we work with text. This basically describes the occurrence of each word within a document
Design a vocabulary of know words(token) and Associate a unique index to each word in the vocabulary.
This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.
Designing the Vocabulary
When the vocabulary size increases, the vector representation of the documents also increases.
In some cases, we can have a huge amount of data and in that cases, the length of the vector that represents a document might be thousands or millions of elements. Furthermore, each document may contain only a few of the known words in the vocabulary.
Therefore, the vector representations will have a lot of zeros. These vectors which have a lot of zeros are called sparse vectors. They require more memory and computational resources.
We can decrease the number of the known words when using a bag-of-words model to decrease the required memory and computational resources. We can use the text cleaning techniques before we create our bag-of-words model:
· Ignoring the case of the words
· Ignoring punctuation
· Removing the stop words from our documents
· Reducing the words to their base form (Text Lemmatization and Stemming)
· Fixing misspelled words
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
documents = ["John likes to watch movies, especially horor movies.","Mary likes movies too."]#Design Vocabulary
count_vectorizer = CountVectorizer()#Create Bag-of-words model
bag_of_words = count_vectorizer.fit_transform(documents)#convert into pandas
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)
In the context of text corpora, it is referred to as the sequence of words.
Where unigram means one word, bigram means two words, and so on.
The ’n’ in the ‘n-gram’ refers to a number of grouped words.
Consider this example: The quick brown fox jumped over the lazy dog.
- bigram would be the quick, quick brown, brown fox, …, i.e, every two consecutive words (or tokens).
- trigram would be the quick brown, quick brown fox, brown fox jump
from nltk import ngrams
text = 'The quick brown fox jumped over the lazy dog'
The problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much ‘informational gain’ compared with some rare domain-specific words.
TD-IDF approach is to fix is to penalize words that are frequent across all the documents.
Term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus
Term Frequency(TF): A scoring of the frequency of the word in the current document.
Inverse Term Frequency(ITF): A scoring of how rare the word is across documents.
TF-IDF score for a given term like this:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pddocuments = ["John likes to watch movies, especially horor movies.","Mary likes movies too."]#tf-idf model
tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(documents)#convert into pandas
feature_name = tfidf_vectorizer.get_feature_names()
pd.DataFrame(values.toarray(), columns = feature_name)
It is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions.
Word2Vec is one of the most popular techniques to learn word embeddings using a shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer.
Word2Vec utilizes two architectures :
- CBOW (Continuous Bag of Words)
- Skip Gram
- CBOW model predicts the current word given context words within a specific window.
- The input layer contains the context words and the output layer contains the current word.
- The hidden layer contains the number of dimensions in which we want to represent the current word present at the output layer.
- Skip gram predicts the surrounding context words using the current word.
- The input layer contains the current word and the output layer contains the context words.
- The hidden layer contains the number of dimensions in which we want to represent the current word present at the input layer.
Pre-Processing function extracts text from the URL and performs tokenization and data cleaning
scrapped_data = urllib.request.urlopen(url)
article = scrapped_data .read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += p.text
# Cleaing the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r's+', ' ', processed_article)# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)all_words = [nltk.word_tokenize(sent) for sent in all_sentences]# Removing Stop Words
for i in range(len(all_words)):
all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]
return all_wordsimport bs4 as bs
from nltk.corpus import stopwords
from gensim.models import Word2Vecdata = pre_processing('https://en.wikipedia.org/wiki/Machine_learning')#Word2Vec
model1 = gensim.models.Word2Vec(data,min_count = 2,
size = 100, window = 15)
similar = model1.wv.most_similar('ml')
for i in similar: