Padding used in NLP: Are they improvers?

With Python — TensorFlow we can use facilitating paddings for sentence fragmentation and analysis in NLP stages. So how?

Photo by João Vítor Heinrichs from Pexels

NLP tells us that we have to do a lot of similar things in order. These are very practical thanks to the TensorFlow and Keras libraries, which can be easily integrated into the system. So the first thing will be to import them. In fact, our job is to be able to use existing APIs. Because wheels are already invented!

NLP; is a set of operations designed to help the machine to encode a sentence structure meaningfully with the help of vectors. When we set out with these existing basics, we can see standards in small scale and similar jobs. However, it is inevitable that the need for some patches emerges with the growth of data.

What do we mean?

In other words, when you enter the work, yes, many similarities in basic expressions are very easy when numbering sentences. For example, as follows:

‘I like this game’
‘You like tennis play’
‘i like as you’

Although there are 12 words in total in these 3 sentences, when we remove the common ones, only 8 remain unique. So it is easy to enumerate them in the NLP process before vectoring them with a tokenizer. Here are the codes:

First, we need to import the TensorFlow libraries we will use.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

Then we throw our example above to a variable as a continuation of our code.

sentences=[
'I like this game’
‘You like tennis play’
‘i like as you'
]

And now we enable the numbering process to be started with the tokenizer variable.

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

Let’s look at what the codes did:

First of all, we have a few sample sentences that are known or can be counted here, actually, it is obvious that we will number the words for this. We have already counted them and said it at the beginning. However, when working with num_words larger data, the equation can be consistent for large numbers. Since the goal here will be to optimize the training time, care should be taken when giving numbers. However, we will enter the number 100 for the num_word ‘assuming’ it is a conventional default value.

In our second observation,fit_on_texts takes the data in value=sentences variable and starts encoding the content.

Finally, word_index returns a dictionary containing key-value pairs. Here the keyword will be assigned a unique value. So now let’s see our result with print:

We can see that a unique value is assigned for repeating words. This is like returning the unique in the list with the set parameter in Python, but the dictionary comes with the format key & value in the expression. Understanding this is important in the first place.

We are moving to the point where the patch is needed

What we simply wanted to explain above was how quickly building small systems with similarities between sentences can be accomplished. However, this NLP is still too small for our business.

So let’s mix things up a little more. First of all, our aim here is to create vector sentence structures with words that we now express with numbers. For this, we can go to the neural network creation and training data preparation phase with texts_to_sequence. Similar processes to fit different sizes are also used in model training with images. Here we stand in an important place regarding understanding and using the right API.

Here again, let’s proceed by examining the above examples. First of all, the sample three sentences consisted of the same number and similar words. Now let’s have a look at this:

Sentence to add: ‘Which game do you like really?’
And we can create “the sequences variable” as follows. Tokenizer to string expressions in text:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizersentences = [
'I like this game',
'You like tennis play',
'i like as you',
'Which game do you like really?'
]tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_indexsequences=tokenizer.texts_to_sequences(sentences)print(word_index)
print(sequences)

Output:

As you can see, when we replaced the numbers with the words we specified with word_index as the key, our last sentence could be sorted exactly like the first 3 sentences. The words between texts were recognized by the machine, coded, and arrayed.

At this stage, we can say that this is indispensable for all emotion analysis studies. Because if the words we would like to extract from the text could not be determined by the same numbers and their place in the string, everything would be much more meaningless.

Now let’s consider a new test data according to the enumerated words:

test= [
'i really like this game',
'your husband likes this tennis'
]test_sequences=tokenizer.texts_to_sequences(test)print(test_sequences)

We have enough numbered words for the first sentence, but we cannot say the same for the second sentence. Well, here’s our question about this:

“Are we going to take the second sentence in this form or ignore it?”

Yes, we will ignore something, but we will artificially fill the gaps for this. When we encounter a value that is not visible in our system, instead of leaving it blank, we will enter a special value again.

So ‘<OOV>’ in parameters

A special token performs this function. It is indicated by oov_token("<OOV>" in the parameters of the tokenizer variable.

tokenizer=Tokenizer(num_word=100, oov_token="<oov>")

Thus, our tokenizer variable has gained its new feature. Then the same test result was the output:

print(test_sequences)output:[[4, 12, 2, 6, 5], [1, 1, 1, 6, 7]]

As you can see, <”OOV”> was assigned number 1 and it affected other existing numbers as well.

As a continuation of this process, padding is our material that fills the empty space in the array and is used to create vectors of the same size. For this, it must first be imported from the library:

from tensorflow.keras.preprocessing.sequence import pad_sequences

Now let’s take a look at how to display the 4 sentences in the sentences that we gave an example at the beginning:

sentences = [
‘I like this game’,
‘You like tennis play’,
‘i like as you’,
‘Which game do you like really?’]padded=pad_sequences(sequences)print(word_index)
print(padded)

Output:

{'<OOV>': 1, 'like': 2, 'you': 3, 'i': 4, 'game': 5, 'this': 6, 'tennis': 7, 'play': 8, 'as': 9, 'which': 10, 'do': 11, 'really': 12}[[ 0  0  4  2  6  5]     #i like this game
[ 0  0  3  2  7  8]     #you like tennis play
[ 0  0  4  2  9  3]     #i like as you
[10  5 11  3  2 12]]    #which game do you like really

As a result, we reached our goal. In other words, the necessary assignments were made for the numerical expression of the sentences in the list, and we were able to create a matrix of the same line length.

Of course thanks to padding!

We are moving to the point where the patch is needed

Footer