Choosing Parameters for Locality-Sensitive Hashing

Introduction

Almost as ubiquitous as the dot product is an efficient method for finding vectors in a set having a large (near 1) dot product with a specified vector. The random projection locality sensitive hash (LSH) function family is the set of functions $f_v, v \in S^n$, where $S^n$ is the $k$-dimensional sphere on which the vectors in our set lie. The function $f_v$ maps a vector $w$ to the "hash value" $\text{sign}(v \cdot w)$. Given a query $q \in S^n$, the vectors matched are those vectors $w$ in our set for which $f_v(w) = f_v(q)$. The probability of matching a vector $w$ is \[p(\theta) = 1 - \frac{\theta}{\pi},\] where $\theta$ is the angle between $w$ and $q$ and the randomness is over the choice of $v$.

Given any LSH function family, there are two ways to tweak the matching probability. One way is to run some $k$ hash functions "in series", meaning choosing $k$ random hash functions $f_1, \ldots, f_k$ that map a vector $w$ to hash value $(f_1(w), \ldots, f_k(w))$. The vectors matching a query $q$ are those that all hash functions $f_1, \ldots, f_k$ map to the same value as $q$—in other words, the hash functions are combined with AND logic. The resulting "in series" function is itself an LSH function.

The other approach is to run some $n$ hash funcions "in parallel", meaning choosing $n$ random hash functions $f^1, \ldots, f^n$ that map a vector $w$ to hash values $f^1(w), \ldots, f^n(w)$. With this approach, the vectors matching a query $q$ are those that at least one hash function $f^1, \ldots, f^n$ maps to the same value as $q$—in other words, the hash functions are combined with OR logic.

One can combine the above approaches by starting with a LSH function family, applying the "in series" approach $k$ times, then applying the "in parallel" approach to the "in series" approach $n$ times for a total of $kn$ hash functions. The resulting approach requires time and space $\theta(kn)$ and the probability of query $q$ matching vector $w$ is \[p(k, n, \theta) = 1 - \left(1 - p(\theta)^k\right)^n, \label{pkn}\tag{1}\] where $\theta$ is the angle between $w$ and $q$.

Parameters

The "combined" approach above requires choosing two parameters: the "in series" parameter $k$ and the "in parallel" parameter $n$. For a given application, how do we choose these parameters? Intuitively, two degrees of freedom admit two constraints. A tangible constraint is to require the probability of matching a vector at a certain angle $\theta$ from the query be a certain value $p$. We can supply two such constraints by specifying two angles $\theta_1, \theta_2$ and their respective probabilities $p_1, p_2$. These constraints give us the simultaneous equations: $$p(k, n, \theta_i) = p_i \label{constraints}\tag{2}$$ for $i=1$ and $i=2$.

While in practice $k$ and $n$ must be integers, we relax this constraint so that they may take any positive real values. Given this relaxation, it's not immediately clear whether these simultaneous equations always admit a solution in $k$ and $n$ and when they do, whether the solution is unique or how to find it.

Note that the problem formulation above and solution below only depend on $p(\theta)$ being a decreasing function, so the solution offered applies to any LSH method that satisfies this modest requirement.

Solution

Equation \ref{pkn} gives \[\log(1-p(k, n, \theta)) = n \log\left(1 - (1 - p(\theta))^k\right). \label{logpkn}\tag{3}\] Equation \ref{logpkn} and the simultaneous equations \ref{constraints} then give \[\frac{\log(1-p_1)}{\log(1-p_2)} = \frac{\log\left(1 - (1 - p(\theta_1))^k\right)}{\log\left(1 - (1 - p(\theta_2))^k\right)}. \label{logratio}\tag{4}\]

The left hand side of \ref{logratio} doesn't depend on $k$ or $n$ and the right hand side depends only on $k$. Set $p_1, p_2, \theta_1, \theta_2$ so that $\theta_1 < \theta_2$ (and thus $p_1 > p_2$). Lemma 1 below shows that a solution to \ref{logratio} exists, as the left hand side of \ref{logratio} is in the interval $(1, \infty)$. Lemma 2 below shows that the right hand side of \ref{logratio} is increasing in $k$ and thus $k$ is unique and can be found numerically by binary search. Then equation \ref{logpkn} applied to either equation in \ref{constraints} can be used to find $n$, which clearly must be unique, establishing uniqueness of the overall solution.

For the following lemmas, let $a, b \in (0, 1)$, $a > b$ and consider the expression \[\frac{\log\left(1-a^k\right)}{\log\left(1-b^k\right)}. \label{logratioab}\tag{5}\] Also note that \[\frac{d}{dk} \log(1-c^k) = \frac{-c^k \log c}{1 - c^k} = \frac{\log c}{1 - c^{-k}}. \label{dlogc}\tag{6}\]

Lemma 1

As $k$ ranges over the positive reals, \ref{logratioab} takes on every value in $(1, \infty)$.

Proof: Consider $\lim_{k \rightarrow 0} \ref{logratioab}$ and $\lim_{k \rightarrow \infty} \ref{logratioab}$. In both cases the approach is the same: \[\begin{align} \lim_k \frac{\log\left(1-a^k\right)}{\log\left(1-b^k\right)} & = \lim_k \frac{\frac{\log a}{1 - a^{-k}}}{\frac{\log b}{1 - b^{-k}}} \text{ (by \ref{dlogc})}\\ & = \frac{\log a}{\log b} \lim_k \frac{1 - b^{-k}}{1 - a^{-k}} \\ & = \frac{\log a}{\log b} \lim_k \frac{b^{-k} \log b}{a^{-k} \log a} \\ & = \lim_k \left(\frac{a}{b}\right)^k. \end{align}\] Now we see as $k \rightarrow 0$ the limit is 1 and as $k \rightarrow \infty$ the limit is $\infty$. By continuity of \ref{logratioab} on $(0, \infty)$, the lemma is proven.

Lemma 2

For $k > 0$, \ref{logratioab} is increasing in $k$.

Proof: By \ref{dlogc}, \[\frac{d}{dk} \ref{logratioab} = \frac{\frac{\log a \log(1-b^k)}{1-a^{-k}} - \frac{\log b \log(1-a^k)}{1-b^{-k}}}{\log(1-b^k)^2}. \label{ddk}\tag{7}\]

\ref{logratioab} is increasing iff $\ref{ddk} > 0$ iff the numerator of $\ref{ddk} > 0$ iff \[\frac{\log(1-b^k)(1-b^{-k})}{\log b} > \frac{\log(1-a^k)(1-a^{-k})}{\log a}\] if \[\frac{\log(1-x^k)(1-x^{-k})}{\log x} \label{xeqn}\tag{8}\] is decreasing in $x$ for $x \in (0, 1)$.

\ref{xeqn} is decreasing iff \begin{align} 0 & > \frac{d}{dx} \ref{xeqn} \\ & = -\frac{1}{1-x^k} kx^{k-1} \frac{1-x^{-k}}{\log x} + \log(1-x^k) \frac{kx^{-k-1}}{\log x} + \log(1-x^k) (1-x^{-k}) \frac{-1}{x\log(x)^2} \\ & = \frac{k}{x \log x} + \frac{\log(1-x^k)}{\log x} \left(k x^{-k-1} - \frac{1-x^{-k}}{x \log x}\right) \\ & = \frac{k}{x \log x} + \frac{\log(1-x^k)}{x \log x} \left(k x^{-k} - \frac{1-x^{-k}}{\log x}\right) \\ & = \frac{1}{x \log x} \left( \frac{\log(1-x^k)}{\log x} \left((k+1) x^{-k} - 1\right) + k \right). \end{align}

As $x \log x < 0$, the final inequality above holds if \[-1 < \frac{\log(1-x^k)}{\log(x^k)} \left((k+1) x^{-k} - 1\right)\] iff \[\frac{\log y}{\log(1-y)} > 1 - \frac{k+1}{y},\] where $y = x^k$, $y \in (0, 1)$ and noting that $\log y$ and $\log(1-y)$ are negative. Because the left hand side of this inequality is positive and the right hand side is negative for all $k > 0$ and $y \in (0, 1)$, the inequality is satisfied.

Summary

The following algorithm finds $k$ and $n$ such that for chosen $p_1, p_2, \theta_1, \theta_2$, equations \ref{constraints} hold. Perform binary search on $k$ to solve equation \ref{logratio}. (To find initial bounds, try exponential decay and growth on $k$ until the left hand side of \ref{logratio} falls within the interval of right hand side values.) Then set: \[n = \frac{\log(1 - p_1)}{\log\left(1 - (1 - p(\theta_1))^k\right)}.\]

Finally, note that $k$ and $n$ must be integers. They may be rounded after calculating them or $k$ may be rounded prior to finding $n$, which may subsequently be rounded. In my experience, rounding causes negligible violation of equations \ref{constraints} so it is not of much concern.

Code

An implementation of this algorithm is available here.