When working with real data, the biggest problems are mostly in data pre-processing. It may vary, but matching can be one of the biggest challenges faced by a lot of analysts. For instance, when we are talking about George Washington and G Washington, of course, we are talking about one person, namely the first President of the United States. We are dealing with duplicate data. Luckily, researchers have developed the probabilistic data matching algorithm or well-known as fuzzy matching.
Research reveals that 94% of businesses admit to having duplicate data, and the majority of these duplicates are non-exact matches and therefore usually remain undetected
— Data Ladder
What is fuzzy string matching?
Probabilistic data matching, often referred to as fuzzy string matching, is the algorithm or computation technique of finding certain strings that match a pattern approximately (rather than exactly). The probabilistic in probabilistic data matching explicitly indicates that the output must be the probability (in the range 0 to 1 or the percentage of similarity) instead of an exact number.
There are many ways to perform fuzzy string matching, but their performance takes a big hit as the size of the data increases, Levenshtein distance, for instance. The reason is that each record is compared against all the other records in the data. As the size of our data increases, the time taken to perform fuzzy string matching increases quadratically. This phenomenon is well-known as quadratic time complexity. Quadratic time complexity represents an algorithm whose performance is directly proportional to the squared size of the input data.
“As the data grows, so does the need for speed”
How do TF-IDF and KNN accelerate the computation time?
Term Frequency — Inverse Document Frequency (TF-IDF) is a Natural Language Processing (NLP) technique to measure how important a word is to a document in a collection of documents. For example, from a collection of documents, how important is the word “flower”? The TF-IDF value increases proportionally to the number of times a word appears (the term frequency) but is decreased by the number of documents that contain the word (inverse document frequency). The IDF portion helps account for the fact that some words are more common in general (for instance the word “the” does not have any information).
The TF-IDF is implemented using n-grams of groups of letters caused by the possibility of misspelling or typo. For instance, the word independence is chunked into the following form depends on the number of n.
1-gram
['independence']
2-gram
['in','nd','de','ep','pe','en','nd','de','en','nc','ce']
3-gram
['ind','nde','dep','epe','pen','end','nde','den','enc','nce']
So n-grams are created from the list of strings that will be used for string matching.
The idea behind the TF-IDF is that it will be a document representation of the data and the selection of the most matched candidate is using the K-Nearest Neighbors and cosine similarity instead of Levenshtein distance (insertions, deletions, or substitutions).
Cosine similarity is a measure of similarity that can be used to compare documents or, say, give a ranking of documents with respect to a given vector of query words. Let x and y be two vectors for comparison. Using the cosine measure as a similarity function, we have
After that, the most matched candidate is compared to the clean or master data. It aims to calculate the string similarity between the most matched candidate within the master data.
Let’s practice using Python
For this example, I will demonstrate the TF-IDF fuzzy string matching approach by matching Room Type from the Room Type Kaggle data. Honestly, I choose this data because: (1) the data is quite small with only 103 rows, and (2) the columns represent the messy string and master string to compare with. In real cases, those two data (messy and clean) can come from different sources. Please run the commands below firstly to create a function of fuzzy string matching using TF-IDF and KNN.
Please store the data of Room Type in a certain folder and that will let us can load it into the Jupyter easily.
# Import module for timing
import time# Load the data
df = pd.read_csv('room_type.csv')# Check the dimension of data
print('Dimension of data: {} rows and {} columns'.format(len(df), len(df.columns)))# Print the first 5 rows
df.head()
In this case, Expedia will be the messy data and Booking.com as the clean or master data. To understand clearly, I will demonstrate how to run the codes and show the result.
# Run the fuzzy string matching algorithm
start = time.time()
df_result = (df.pipe(fuzzy_tf_idf, # Function and messy data
column = 'Expedia', # Messy column in data
clean = df['Booking.com'], # Master data (list)
mapping_df = df, # Master data
col = 'Result') # Can be customized
)
end = time.time()# Print the computation time
print('Fuzzy string matching in {} seconds'.format(end - start))# View the result of fuzzy string matching
df_result.head()
The algorithm needs only 0.050 seconds to compare the 103 rows of messy data to the 103 rows of master data. It is priceless and can be implemented to the larger data of your own. But how long the computation process when we use the Levenshtein distance? It must be compared to make sure that the TF-IDF and K-Nearest Neighbors improve the computation time efficiently.
The comparison to Levenshtein distance
For this tutorial, I have made a function that has similar input and output to the previous function (fuzzy string matching with TF-IDF and K-Nearest Neighbors). Firstly, run the following codes to implement the fuzzy string matching with Levenshtein distance.
Now, it’s time to implement the fuzzy string matching algorithm with Levenshtein distance!
# Run the fuzzy string matching algorithm
start = time.time()
df_result = (df.pipe(stringMatching, # Function and messy data
column = 'Expedia', # Messy column in data
clean = df['Booking.com'], # Master data (list)
mapping_df = df, # Master data
col = 'Result') # Can be customized
)
end = time.time()# Print the computation time
print('Fuzzy string matching in {} seconds'.format(end - start))# View the result of fuzzy string matching
df_result.head()
Compared to the fuzzy string matching algorithm with TF-IDF and KNN, the Levenshtein distance needs 1.216 seconds or 24.32 times longer. It is important to note that this computation time will grow with the number of data. Learn more about the comparison of computation time between TF-IDF and Levenshtein distance here.
Conclusion
For small data, the Levenshtein distance (fuzzy-wuzzy module) is a good choice to perform fuzzy string matching. However, as the size of the data grows, so does the computation time. An alternative approach is to use the NLP technique of TF-IDF combined with K-Nearest Neighbors and n-grams to find the matched strings. FAISS and HSNW are other algorithms that good to try in improving the performance to find the nearest neighbors.
References
[1] J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques (2012). Elsevier Inc.
[2] C. van den Berg. Super Fast String Matching in Python (2019). https://bergvca.github.io/2017/10/14/super-fast-string-matching.html.
[3] E. Elahi. Fuzzy Matching 101: Cleaning and Linking Messy Data Across the Enterprise (2019). https://dataladder.com/fuzzy-matching-101/.
[4] A. Pearl. Fuzzy String Matching with TF-IDF (2019). https://adrianpearl.github.io/fuzzystring.html.