Prerequisites : Knowledge of K-Nearest Neighbor (KNN) algorithm
Curse of Dimensionality
The dimension of a data set corresponds to the number of attributes/features that exist in a data set. A data set with a large number of attributes, generally of the order of hundred or more, is referred to as high dimensional data. The difficulties related to training machine learning models due to high dimensional data is referred to as ‘Curse of Dimensionality’.
There are two popular aspects of curse of dimensionality; ‘data sparsity’ and ‘distance concentration’, which are discussed in the following sections.
Machine learning models are trained to predict the outcome for input data, with high accuracy. The input data is split in such a way that part of the data is used for training the model, and a part of the data is used to evaluate how the model performs on unseen data.
A generalized model’s prediction accuracy on the unseen data should be very close to its accuracy on the training data. As the number of dimensions increases, the number of training samples required to generalize a model also increase phenomenally. In reality, the training samples available for building the model may not capture all possible combinations because some combinations occur more often than others. This aspect is defined as data sparsity in high dimensional data. When our model has been trained with sparse data, it learns from the frequently occurring combinations, and in real-time, it may not predict the outcome accurately when fed with less frequently occurring combinations.
Distance concentration refers to the problem of “all the pairwise distances between different samples in space converging to the same value as the dimensionality of the data increases”. The nearest neighbor algorithm is a popular algorithm that uses distance-based metrics to identify proximity of the samples. Due to distance concentration, the concept of proximity of the samples may not be qualitatively relevant in higher dimensions (see below graph).
The graphs show a density plot of the distances between the points and the probability of frequency of occurrence of the distance, as the dimensions increase. We see that the density is approximately uniform for one dimension. As the number of dimensions increases, we see that the spread of the frequency plot decreases, which indicates that distances between different samples or points tend towards a single value. This indicates that the metrics used in algorithms such as KNN that work for lower dimensions may not work for higher dimensions.
Approximate Nearest Neighbor Search
In the Approximate neighbor search, we relax our goal of exactly finding the closest point. The basic idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity.
“pj’ is the nearest point to “q”. In the approximate nearest neighbor problem, any point within the radius “cr” is accepted.
Intrinsic Dimensionality of Data
In higher dimensions we should be able to design algorithms that are modified according to their intrinsic (complexity of the data set) rather than ambient (host space) dimension. The concept of “intrinsic dimensionality” refers to the true dimensionality of data, which is closely related to doubling dimension(used in mathematical contexts). The doubling dimension of a metric space 𝑋 is the smallest positive integer 𝑘 such that every ball of 𝑋 can be covered by 2𝑘 balls of half the radius. This concept is used to develop ANN algorithms.
Among ANN algorithms proposed recently, perhaps the most popular is Locality Sensitive Hashing (LSH).
Hashing in simple words is the transformation of a string of characters into a usually shorter fixed-length value (also called “key”) that represents the original string. Hashing is used to index and retrieve items in a database as it is faster to find the item using the shorter hashed key than to find it using the original value.
Locality Sensitive Hashing (LSH)
Locality Sensitive Hashing (LSH) is a method which is used for determining which items in a given set are similar. Rather than using the naive approach of comparing all pairs of items within a set, items are hashed into buckets, such that similar items will be more likely to hash into the same buckets. The number of comparisons needed will be reduced; only the items within anyone bucket will be compared, and this in turn reduces the complexity of the algorithm. The main application of LSH is to provide a method for efficient approximate nearest neighbor search through probabilistic dimension reduction of high-dimensional data.
LSH has some huge advantages. It is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table. LSH primarily speeds up the recommendation process compared to more traditional recommendation engines.
Python Implementation of LSH
LSH depends on the following libraries:
- Redis (if persistency through Redis is needed)
- bit array (if hamming distance is used as distance function)
To create 6-bit hashes for input data of 8 dimensions:
>>> from lshash import LSHash
>>> lsh = LSHash(6, 8)
[((1, 2, 3, 4, 5, 6, 7, 8), 1.0),
((2, 3, 4, 5, 6, 7, 8, 9), 11)]
To initialize LSH instance:
LSHash(hash_size, input_dim, num_of_hashtables=1, storage=None, matrices_filename=None, overwrite=False)
hash_size: The length of the resulting binary hash.
input_dim: The dimension of the input vector.
num_hashtables = 1: (optional) The number of hash tables used for multiple look-ups.
storage = None: (optional) Specify the name of the storage to be used for the index storage. Options include “redis”.
matrices_filename = None: (optional) Specify the path to the .npz file random matrices are stored or to be stored if the file does not exist yet
overwrite = False: (optional) Whether to overwrite the matrices file if it already exist
input_point: The input data point is an array or tuple of numbers of input_dim.
extra_data = None: (optional) Extra data to be added along with the input_point.
lsh.query(query_point, num_results=None, distance_func="euclidean"):
query_point: The query data point is an array or tuple of numbers of input_dim.
num_results = None: (optional) The number of query results to return in ranked order. By default all results will be returned.
distance_func = "euclidean": (optional) Distance function to use to rank the candidates. By default euclidean distance function will be used.
For further reference: https://www.mit.edu/~andoni/LSH/