9 Most Useful String Methods for a Data Scientist

Most useful string methods for a Data Scientist during data preprocessing.

In simple words, Machine Learning is training/teaching algorithms with historical data to predict output on unseen data. Most of the times, the type of data is in the form of text. When working with text data, one must be familiar with python’s available string methods to make life easier.

In this post, I’ll talk about some of the string methods that I personally found very useful while handling text data.

split() separates the string into words based on the pre-defined separator. It returns a list of the words in the string.

similar functions : rsplit()

Syntax

str.split(sep=None, maxsplit=-1)

sep – separator used to break the string into words, uses white space as the default separator
maxsplit – max number of splits to be done(the list will have at most maxsplit+1 elements), uses -1(no limit on the number of splits) as maxsplit if not specified.

strip() removes the leading(beginning) and trailing(ending) spaces of the string.

similar functions : rstrip() & lstrip()

Syntax

str.strip([chars])

chars – set of characters to be removed, default it removes whitespace.

replace() is used to replace all old substring of the string with new.

Syntax

str.replace(old, new[, count])

old – old substring to look for
new – new substring to replace the old substring with
count – number of times to place old substring with a new substring.

join() is used to concatenate/join the strings in an iterable with a string separator.

Syntax

str.join(iterable)

iterable – like list, tuple, string etc.

lower() converts all the characters of a string to lowercase.

similar functions : upper()

Syntax

str.lower()

count() returns the number of times a substring appeared in a string.

Syntax

str.count(sub[, start[, end]])

sub – substring to search for
start – starting index to search the substring in the given string, default index is 0.
end – ending index of the string, default is the end of the string.

isdigit() returns True if all the characters in the given string are digits, returns False if at least one character is other than a digit.

NOTE: isdigit() is very useful in ML preprocessing to check if any value in the columns of a Dataframe is a digit. Sometimes, you may find special characters( ‘ ‘, ? etc) in place of values.

Syntax

str.isdigit()

casefold() is used for caseless matching. It is similar to lower(), but more aggressive because it is intended to remove all case distinctions in a string.

Syntax

str.casefold()

find() returns the position/index of the first occurrence of the specified substring in the given string, returns -1 if the substring is not found.

similar functions : rfind()

Syntax

str.find(sub[, start[, end]])

sub – substring to search in the given string
start, end – range(starting & ending index) to search the substring within.

Syntax

Syntax

Syntax

Syntax

Syntax

Syntax

Syntax

Syntax

Syntax

Footer