LSH is a technique used to find similar documents in a large corpus. Here we use “documents” in a broad sense to represent any data that can be represented as a set, such as a shopping basket containing a set of grocery items.
There are different LSH functions associated with different distance measures. LSH functions are used to group similar items into the same buckets. Before we can do that, we first need to define “similar” in terms of the distance measure under consideration. That’s why LSH functions are always associated with a distance measure.
Loosely speaking, for a given distance measure, we choose a LSH function such that similar items are more likely to group together after hashing.
Min hashing functions for Jaccard distance
The min hashing functions convert a set into a signature. The length of the signature is equal to the number of permutations on the rows.
The probability that the minhash function for a random permutation of rows produce the same value for two sets are equal to the Jaccard similarity of those two sets.
Another way of interpretating this, is that we also use the minhashing output as a probability estimation that two sets are identical.
With the signatures of two sets, we can simply calculate the fraction of elements which are equal at the same location, this fraction serves as a good approximation to the Jaccard Similarity of the two sets.
Using the datasketch package:
from datasketch import MinHashdata1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'documents']
m1, m2 = MinHash(), MinHash()
for d in data1:
m1.update(d.encode('utf8'))
for d in data2:
m2.update(d.encode('utf8'))
print("Estimated Jaccard for data1 and data2 is", m1.jaccard(m2))
s1 = set(data1)
s2 = set(data2)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))
print("Actual Jaccard for data1 and data2 is", actual_jaccard)
We can verify that the results produced by MinHashing is the same as a rudimentary Jaccard similarity.
Alternatively, one can also verify the results by accessing the signature by m1.digest()
:
num_of_common = np.argwhere(m1.digest()==m2.digest()).flatten()
len(num_of_common)/128 # 128 is the default number of perms
In summary, we can use the minhashing functions associated with Jaccard distance to group those similar items (similar in terms of the distance metric under consideration, in here it is Jaccard).
Bit sampling for Hamming distance
We can use bit sampling as a LSH function accompanying Hamming distance.
See Mining of massive dataset Section 3.7.
Or this nice post: https://randorithms.com/2019/09/19/Visual-LSH.html
Random projection for cosine distance
Similar to minhash for Jaccard similarity, we can use random projections to approximate the cosine similarity.
By randomly choosing hyperplanes defined by a unit vector, we can use monte carlo methods to estimate the similarity. Wikipedia provides a good explanation on this.
One can also see Mining of massive dataset Section 3.7.
LSH functions for Euclidean space
This topic is a little bit more complicated, see this nice post: https://randorithms.com/2019/09/19/Visual-LSH.html
- A visual and intuitive introduction: https://randorithms.com/2019/09/19/Visual-LSH.html
- On random projection for cosine distance: https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection