data:image/s3,"s3://crabby-images/ef3f5/ef3f56ab792b6fc364780b19cbeec3fab5fe59c1" alt=""
data:image/s3,"s3://crabby-images/49841/498411937fe0422f6e3076f2b4895134c1f0b78e" alt="Sivasai Yadav Mudugandla"
Most useful string methods for a Data Scientist during data preprocessing.
In simple words, Machine Learning is training/teaching algorithms with historical data to predict output on unseen data. Most of the times, the type of data is in the form of text. When working with text data, one must be familiar with python’s available string methods to make life easier.
In this post, I’ll talk about some of the string methods that I personally found very useful while handling text data.
split() separates the string into words based on the pre-defined separator. It returns a list of the words in the string.
similar functions : rsplit()
Syntax
str.split
(sep=None, maxsplit=-1)
- sep – separator used to break the string into words, uses white space as the default separator
- maxsplit – max number of splits to be done(the list will have at most
maxsplit+1
elements), uses-1
(no limit on the number of splits) as maxsplit if not specified.
strip() removes the leading(beginning) and trailing(ending) spaces of the string.
similar functions : rstrip() & lstrip()
Syntax
str.strip
([chars])
- chars – set of characters to be removed, default it removes whitespace.
replace() is used to replace all old substring of the string with new.
Syntax
str.replace
(old, new[, count])
- old – old substring to look for
- new – new substring to replace the old substring with
- count – number of times to place old substring with a new substring.
join() is used to concatenate/join the strings in an iterable with a string separator.
Syntax
str.join
(iterable)
- iterable – like list, tuple, string etc.
lower() converts all the characters of a string to lowercase.
similar functions : upper()
Syntax
str.lower
()
count() returns the number of times a substring appeared in a string.
Syntax
str.count
(sub[, start[, end]])
- sub – substring to search for
- start – starting index to search the substring in the given string, default index is 0.
- end – ending index of the string, default is the end of the string.
isdigit() returns True
if all the characters in the given string are digits, returns False
if at least one character is other than a digit.
NOTE: isdigit() is very useful in ML preprocessing to check if any value in the columns of a Dataframe is a digit. Sometimes, you may find special characters( ‘ ‘, ? etc) in place of values.
Syntax
str.isdigit
()
casefold() is used for caseless matching. It is similar to lower(), but more aggressive because it is intended to remove all case distinctions in a string.
Syntax
str.casefold
()
find() returns the position/index of the first occurrence of the specified substring in the given string, returns -1
if the substring is not found.
similar functions : rfind()
Syntax
str.find
(sub[, start[, end]])
- sub – substring to search in the given string
- start, end – range(starting & ending index) to search the substring within.