Explaining the working of regular expressions in Python
When writing regular expressions (regex) in Python language, we always start with the letter r. In this tutorial, we will understand the reason behind using it by answering the following questions:
- What are the escape sequences?
- How Python interpreter interprets escape sequences with or without the letter r?
- How regular expressions work in the Python language?
- The importance of using the letter r in regular expressions
An escape sequence is a character set that does not represent itself when used in a text definition. It gets translated to some other character or character set that is otherwise difficult to present in a programming language. For example, in Python language, the character set n represents a new line, and t represents a tab. Both the character sets, n, and t are escape sequences.
The list of standard escape sequences understood by the Python interpreter and their associated meanings are as follows:
To understand its impact on escape sequences, let us have a look at the following example:
#### Sample Text Definition
text_1 = "My name is Ujjwal Dalmia.nI love learning and teaching the Python language"
print(text_1)#### Sample Output
My name is Ujjwal Dalmia.
I love learning and teaching the Python language#### Sample Text Definition
text_2 = "My name is Ujjwal Dalmia.sI love learning and teaching the Python language"
print(text_2)#### Sample Output
My name is Ujjwal Dalmia.sI love learning and teaching the Python language
In text_1 above, the example uses n character set whereas text_2 uses s. From the escape sequences table shared in section 1, we can see that n is part of the standard escape sequence-set in Python language, whereas s is not. Therefore, when we print both the variables, escape sequence n is interpreted as a new line character by the Python interpreter, whereas s is left as it is. Note that the definition of both text_1 and text_2 does not include the letter r.
Let us take a step further and include the letter r in the text definition.
#### Sample Text Definition (with letter "r")
text_1 = r"My name is Ujjwal Dalmia.nI love learning and teaching the Python language"
print(text_1)#### Sample Output
My name is Ujjwal Dalmia.nI love learning and teaching the Python language#### Sample Text Definition (with letter "r")
text_2 = r"My name is Ujjwal Dalmia.sI love learning and teaching the Python language"
print(text_2)#### Sample Output
My name is Ujjwal Dalmia.sI love learning and teaching the Python language
The inclusion of the letter r had no impact on text_2 because s is not part of the standard escape sequence set in Python language. Surprisingly, for text_1, the Python interpreter did not convert n into the new line character. It is because the presence of the letter r has transformed the text into a raw-string. In simple terms, the letter r has instructed the Python interpreter to leave the escape sequence as it is.
To understand how regular expressions work in Python language, we will use the sub() function (re Python package) that substitutes the part of old text with the new text based on the regular expression driven pattern matching. Let us understand this with an example:
#### Importing the re package
import re#### Using the sub function
re.sub("ts","s", "tsing")#### Sample Output
'sing'
In this example, we are trying to replace the letter s preceded by a tab with the standalone letter s. One can see from the output that the text tsing converts to sing. Let us refer to the below flow chart to understand how the sub() function produced the desired result. In the flow chart, we refer to ts as regex, letter s as new text, and tsing as old text.
Explanation
In the previous example, we have used the character set t, which is part of the standard escape list in Python language. Therefore, in the first step, the Python interpreter replaced the escape sequence with the tab in both regex text and the old text. Since the regex pattern matched with the input text in the last step, the substitution took place.
In the next example, we will use a different character set, s, that is not a part of the standard escape list in Python language.
#### Importing the re package
import re#### Using the sub function(this time with a non-standard escape sequence)
re.sub("ss","s", "ssing")#### Sample Output
"ssing"
In this example, we are trying to replace any instance of the letter s preceded by s with the standalone letter s. It is evident that there was no change in the input text, and the output remained the same as the old text. Again, in the flow chart, we refer to ss as regex, s as the new text, and ssing as the old. Let us understand the reason behind this behavior from the below flowchart:
Explanation
In step 1, since s is not a standard escape sequence, the Python interpreter neither modified the regular expression nor the old text and left them as it is. In step 2, since s is a metacharacter representing space, it gets converted from ss to space s. Because in the old text, space s did not exist, there was no positive match, and hence the old-text remained the same.
The two learnings we can draw from this section are:
The evaluation of old text and new text for escape sequences is done only by the Python interpreter. For the regular expression by the Python and the regex interpreter. Therefore, for both old and new text, the outcome of step 1 is their final version, and for regex, it is step 2.
In a scenario where the texts and regex pattern contain only the standard escape sequence, which is part of Python language, we get our desired results. Whereas, when there are additional metacharacters, the results might not be as per expectation.
From the 2nd example of the previous section, we saw that the regex failed to deliver the expected result. To find the right solution, let us work our way backward.
Explanation
To substitute ss from the old text with the letter s, we expect the regex pattern at step 3 to match the text we want to replace.
To achieve this, we need the regex pattern to be \ss by the end of step two. When the regex interpreter encounters this pattern, it will convert the metacharacters double backslashes to single, and the output of step 2 will be ss.
Finally, to ensure that regex at step 2 is \ss, we pass \\ss at step 1. It is because double backslashes are a standard escape sequence of Python language and, as per the table in section 1, the Python interpreter will convert double backslashes to single. To get \ss as the output of step 1, we supply \\ss as our first regular expression. The Python interpreter will convert the \\ss text pattern to \ss.
Therefore, the solution code to the problem mentioned above is as follows:
#### Importing the re package
import re#### Using the sub function with the modified regex
re.sub("\\ss","s", "ssing")#### Sample output
'sing'
We now have the solution to our problem, but the question which remains is, where did we use the letter r? The regex expression arrived at in the previous discussion is a candidate solution. In simple regex requirements, one can work with the above approach, but consider a scenario where the regular expression dictates the use of multiple meta characters and standard escape sequences. It would require us:
- To first differentiate between standard and non-standard escape sequences
- Then, appropriately place the right number of backslashes every time we encounter an escape sequence or metacharacters.
In such cumbersome scenarios, taking the below approach helps:
The only change we have made here is to replace four backslashes with two preceded by the letter r. It will ensure that in step 1, the Python interpreter considers the regular expression as the raw-string and leaves it as it is. Converting regex to a raw string will ensure the following:
- We are free from the worry of remembering the list of Python standard escape sequences.
- We do not have to worry about the right number of backslashes for the presence of standard escape sequences or any metacharacters.
Given above, our final and most appropriate solution will be as follows:
#### Importing the re package
import re#### Using the sub function with the modified regex
re.sub(r"\ss","s", "ssing")#### Sample Output
'sing'
Watch out for this letter r whenever writing your next regular expression. I hope that this tutorial gave you a good insight into the working of the regular expression.
HAPPY LEARNING ! ! ! !