Choosing Parameters for Locality-Sensitive Hashing

Introduction

Almost as ubiquitous as the dot product is an efficient method for finding vectors in a set having a large (near 1) dot product with a specified vector. The random projection locality sensitive hash (LSH) function family is the set of functions \(f_v, v \in S^n\), where \(S^n\) is the \(k\)-dimensional sphere on which the vectors in our set lie. The function \(f_v\) maps a vector \(w\) to the "hash value" \(\text{sign}(v \cdot w)\). Given a query \(q \in S^n\), the vectors matched are those vectors \(w\) in our set for which \(f_v(w) = f_v(q)\). The probability of matching a vector \(w\) is \[p(\theta) = 1 - \frac{\theta}{\pi},\] where \(\theta\) is the angle between \(w\) and \(q\) and the randomness is over the choice of \(v\).

Given any LSH function family, there are two ways to tweak the matching probability. One way is to run some \(k\) hash functions "in series", meaning choosing \(k\) random hash functions \(f_1, \ldots, f_k\) that map a vector \(w\) to hash value \((f_1(w), \ldots, f_k(w))\). The vectors matching a query \(q\) are those that all hash functions \(f_1, \ldots, f_k\) map to the same value as \(q\)—in other words, the hash functions are combined with AND logic. The resulting "in series" function is itself an LSH function.

The other approach is to run some \(n\) hash funcions "in parallel", meaning choosing \(n\) random hash functions \(f^1, \ldots, f^n\) that map a vector \(w\) to hash values \(f^1(w), \ldots, f^n(w)\). With this approach, the vectors matching a query \(q\) are those that at least one hash function \(f^1, \ldots, f^n\) maps to the same value as \(q\)—in other words, the hash functions are combined with OR logic.

One can combine the above approaches by starting with a LSH function family, applying the "in series" approach \(k\) times, then applying the "in parallel" approach to the "in series" approach \(n\) times for a total of \(kn\) hash functions. The resulting approach requires time and space \(\theta(kn)\) and the probability of query \(q\) matching vector \(w\) is \[p(k, n, \theta) = 1 - \left(1 - p(\theta)^k\right)^n, \label{pkn}\tag{1}\] where \(\theta\) is the angle between \(w\) and \(q\).

Parameters

The "combined" approach above requires choosing two parameters: the "in series" parameter \(k\) and the "in parallel" parameter \(n\). For a given application, how do we choose these parameters? Intuitively, two degrees of freedom admit two constraints. A tangible constraint is to require the probability of matching a vector at a certain angle \(\theta\) from the query be a certain value \(p\). We can supply two such constraints by specifying two angles \(\theta_1, \theta_2\) and their respective probabilities \(p_1, p_2\). These constraints give us the simultaneous equations: $$p(k, n, \theta_i) = p_i \label{constraints}\tag{2}$$ for \(i=1\) and \(i=2\).

While in practice \(k\) and \(n\) must be integers, we relax this constraint so that they may take any positive real values. Given this relaxation, it's not immediately clear whether these simultaneous equations always admit a solution in \(k\) and \(n\) and when they do, whether the solution is unique or how to find it.

Note that the problem formulation above and solution below only depend on \(p(\theta)\) being a decreasing function, so the solution offered applies to any LSH method that satisfies this modest requirement.

Solution

Equation \ref{pkn} gives \[\log(1-p(k, n, \theta)) = n \log\left(1 - (1 - p(\theta))^k\right). \label{logpkn}\tag{3}\] Equation \ref{logpkn} and the simultaneous equations \ref{constraints} then give \[\frac{\log(1-p_1)}{\log(1-p_2)} = \frac{\log\left(1 - (1 - p(\theta_1))^k\right)}{\log\left(1 - (1 - p(\theta_2))^k\right)}. \label{logratio}\tag{4}\]

The left hand side of \ref{logratio} doesn't depend on \(k\) or \(n\) and the right hand side depends only on \(k\). Set \(p_1, p_2, \theta_1, \theta_2\) so that \(\theta_1 < \theta_2\) (and thus \(p_1 > p_2\)). Lemma 1 below shows that a solution to \ref{logratio} exists, as the left hand side of \ref{logratio} is in the interval \((1, \infty)\). Lemma 2 below shows that the right hand side of \ref{logratio} is increasing in \(k\) and thus \(k\) is unique and can be found numerically by binary search. Then equation \ref{logpkn} applied to either equation in \ref{constraints} can be used to find \(n\), which clearly must be unique, establishing uniqueness of the overall solution.

For the following lemmas, let \(a, b \in (0, 1)\), \(a > b\) and consider the expression \[\frac{\log\left(1-a^k\right)}{\log\left(1-b^k\right)}. \label{logratioab}\tag{5}\] Also note that \[\frac{d}{dk} \log(1-c^k) = \frac{-c^k \log c}{1 - c^k} = \frac{\log c}{1 - c^{-k}}. \label{dlogc}\tag{6}\]

Lemma 1

As \(k\) ranges over the positive reals, \ref{logratioab} takes on every value in \((1, \infty)\).

Proof: Consider \(\lim_{k \rightarrow 0} \ref{logratioab}\) and \(\lim_{k \rightarrow \infty} \ref{logratioab}\). In both cases the approach is the same: \[\begin{align} \lim_k \frac{\log\left(1-a^k\right)}{\log\left(1-b^k\right)} & = \lim_k \frac{\frac{\log a}{1 - a^{-k}}}{\frac{\log b}{1 - b^{-k}}} \text{ (by \ref{dlogc})}\\ & = \frac{\log a}{\log b} \lim_k \frac{1 - b^{-k}}{1 - a^{-k}} \\ & = \frac{\log a}{\log b} \lim_k \frac{b^{-k} \log b}{a^{-k} \log a} \\ & = \lim_k \left(\frac{a}{b}\right)^k. \end{align}\] Now we see as \(k \rightarrow 0\) the limit is 1 and as \(k \rightarrow \infty\) the limit is \(\infty\). By continuity of \ref{logratioab} on \((0, \infty)\), the lemma is proven.

Lemma 2

For \(k > 0\), \ref{logratioab} is increasing in \(k\).

Proof: By \ref{dlogc}, \[\frac{d}{dk} \ref{logratioab} = \frac{\frac{\log a \log(1-b^k)}{1-a^{-k}} - \frac{\log b \log(1-a^k)}{1-b^{-k}}}{\log(1-b^k)^2}. \label{ddk}\tag{7}\]

\ref{logratioab} is increasing iff \(\ref{ddk} > 0\) iff the numerator of \(\ref{ddk} > 0\) iff \[\frac{\log(1-b^k)(1-b^{-k})}{\log b} > \frac{\log(1-a^k)(1-a^{-k})}{\log a}\] if \[\frac{\log(1-x^k)(1-x^{-k})}{\log x} \label{xeqn}\tag{8}\] is decreasing in \(x\) for \(x \in (0, 1)\).

\ref{xeqn} is decreasing iff \begin{align} 0 & > \frac{d}{dx} \ref{xeqn} \\ & = -\frac{1}{1-x^k} kx^{k-1} \frac{1-x^{-k}}{\log x} + \log(1-x^k) \frac{kx^{-k-1}}{\log x} + \log(1-x^k) (1-x^{-k}) \frac{-1}{x\log(x)^2} \\ & = \frac{k}{x \log x} + \frac{\log(1-x^k)}{\log x} \left(k x^{-k-1} - \frac{1-x^{-k}}{x \log x}\right) \\ & = \frac{k}{x \log x} + \frac{\log(1-x^k)}{x \log x} \left(k x^{-k} - \frac{1-x^{-k}}{\log x}\right) \\ & = \frac{1}{x \log x} \left( \frac{\log(1-x^k)}{\log x} \left((k+1) x^{-k} - 1\right) + k \right). \end{align}

As \(x \log x < 0\), the final inequality above holds if \[-1 < \frac{\log(1-x^k)}{\log(x^k)} \left((k+1) x^{-k} - 1\right)\] iff \[\frac{\log y}{\log(1-y)} > 1 - \frac{k+1}{y},\] where \(y = x^k\), \(y \in (0, 1)\) and noting that \(\log y\) and \(\log(1-y)\) are negative. Because the left hand side of this inequality is positive and the right hand side is negative for all \(k > 0\) and \(y \in (0, 1)\), the inequality is satisfied.

Summary

The following algorithm finds \(k\) and \(n\) such that for chosen \(p_1, p_2, \theta_1, \theta_2\), equations \ref{constraints} hold. Perform binary search on \(k\) to solve equation \ref{logratio}. (To find initial bounds, try exponential decay and growth on \(k\) until the left hand side of \ref{logratio} falls within the interval of right hand side values.) Then set: \[n = \frac{\log(1 - p_1)}{\log\left(1 - (1 - p(\theta_1))^k\right)}.\]

Finally, note that \(k\) and \(n\) must be integers. They may be rounded after calculating them or \(k\) may be rounded prior to finding \(n\), which may subsequently be rounded. In my experience, rounding causes negligible violation of equations \ref{constraints} so it is not of much concern.

Code

An implementation of this algorithm is available here.