In this article I will guide you through my thoughts on how to build a fuzzy search algorithm. A very practical use case of this algorithm is that we can use it to find alternative names for a brand saying ‘Amazon’ and we want it to return strings such as ‘AMZ’, ‘AMZN’ or ‘AMZN MKTP’.
The article follows an outline as the following:
- Fuzzy search with the FuzzyWuzzy
- Fuzzy search with the HMNI
- Fuzzy search with an integrated algorithm
- Return an alternative names table
FuzzyWuzzy is a great python library can be used to complete a fuzzy search job. Essentially it uses Levenshtein Distance to calculate the difference / distance between sequences.
According to the Wikipedia, the Levenshtein distance is a metric of evaluating the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. This means that the evaluation metric within the FuzzyWuzzy can be very good at performing fuzzy search for the misspelled words and capturing the longest common subsequence between inputs.
But for some cases, for instance, the abbreviation of brand names, only knowing the difference on a character-level is probably not enough. It would also make sense to know the phonetic and semantic difference before return the most similar name matches.
Therefore, I would like to introduce another library called HMNI which can help us check the phonetic similarity between inputs but first let me build a sample dataset for a more proper test.
Starting with the FuzzyWuzzy library, to install it, we could run the following commands:
# Using PIP via PyPI
pip install fuzzywuzzy# Or the following to install python-Levenshtein too
pip install fuzzywuzzy[speedup]
Then we will continue to create a sample dataset for our test.
# Sample Dataset
df = pd.DataFrame(index =['AMAZON', 'Netflix', 'PayPal', 'Apple', 'Spotify', 'Apple', 'Facebook', 'Google', 'Starbucks'],
columns = ['AMZ', 'PP', 'FACEBK', 'SPTF*', 'APPL', 'STARBK', 'GG', 'Starbucks TORONTO', 'NFLIX'])# Print
df
We have the row indexes set as the full brand names and column names set as the potential abbreviations of these brands.
Now, we can define a function that takes two strings as input and returns a similarity score as output. In FuzzyWuzzy, we can use the function fuzz.ratio() to compute the similarity score between two inputs.
# Customized similarity function with FuzzyWuzzy
def similarity_fuzzy(word1, word2):score = fuzz.ratio(word1, word2)
d = score/100
return d
Now, we just need to implement the similarity_fuzzy function on each pair of the row index and column name to replace these NaN values with similarity scores.
from tqdm import tqdmfor i in tqdm(range(8)): # range in number of rowsfor j in range(9): # range in number of columns
df.loc[df.index[i], df.columns[j]] = similarity_fuzzy(str(df.index[i]), str(df.columns[j]))
df
As we can see, the FuzzyWuzzy doesn’t perform very well to find the correct abbreviations for the input brand names. The main reason I think it’s because the abbreviations have lost a lot of characters so the metrics of using Levenshtein distance is not the optimal solution for this case.
And this is the place I think a new perspective of computing phonetic similarity should be a great help!
Generally, HMNI is a library that follows the cognitive process of applying soft-logic to approximate the spelling and phonetic (sound) characteristics.
A great article to explore:
To test the fuzzy name matching performance with the HMNI, we may follow the same steps as for the FuzzyWuzzy.
To install the HMNI:
# Using PIP via PyPI
pip install hmni
To Initialize a Matcher Object:
import hmni
matcher = hmni.Matcher(model='latin')
To customize our similarity function:
def similarity_hmni(word1, word2):d = matcher.similarity(word1, word2)
return d
Then, we can test the similarity_hmni function on the same sample dataset to compare the performance.
The difference is very obvious between the FuzzyWuzzy and HMNI. HMNI seems to be better at finding the abbreviations for the brand inputs based on the potential phonetic characteristics.
But it still doesn’t mean there is no disadvantage of using HMNI. For instance, looking at the ‘PayPal’ and ‘Apple’, we find that HMNI tends to be bad at separating these two brands since the similarity scores are 0.71 & 0.41 and 0.66 & 0.94 respectively. This may cause some confusion if we add more inputs into the dataset. Also, for the exact match between ‘Starbucks’ and ‘Starbucks Toronto’, the HMNI should be more confident about its prediction but now it only returns a value at 0.5.
This probably means we should consider integrate both of the two dimensions, the phonetic similarity and the Levenshtein distance to achieve an optimal balance.
My solution to this is simple. Just add two functions together into one new function and we will adjust the weights to determine the final output scores.
def similarity_calculator(word1, word2):score_1 = fuzz.ratio(word1, word2) # score from fuzzywuzzy
score_2 = matcher.similarity(word1, word2) # score from hmni
score_1 = score_1/100
score = 0.2*score_1 + 0.8*score_2 # customize your own weights
return score
By integrating these two functions together, we seem to reach a balance that we can set 60% as a threshold to separate matches and non-matches. Except for the Starbucks, we probably should search for these big brand matches directly by using the .find() function in python.
Now, the remaining work is to create a table of top alternative name matches for the brand inputs. For the sample dataset, I choose to return only the top match with the highest similarity score. The code looks like the following:
# Return a new column 'Max' that contains the alternative names for each brand input
df['Max'] = df.astype(float).idxmax(axis=1)# Create a new dataframe 'result' to display only the input & output columns
result = pd.DataFrame(list(df.index), columns=['Input'])
result['Alternative Names'] = list(df.Max)
result
Your final output will look like a dataset above.
Thanks for your reading. And I hope this article could be helpful for anyone who is looking for some guide on how to build a fuzzy search algorithm with machine learning.
I think this article could be a great start for you to develop your own fuzzy search algorithm 🙂
Thanks Again!
See you in the next article ~