A hash function takes a key and returns the index of the key in the [[hash table]]. The hash function can be computed very efficiently (in constant time).
When the hash function for two keys result in the same index, a **collision** occurs. Different implementations of hash functions solve this in different ways.
Many languages and runtimes implement hash functions.
Hash functions are also important in cryptography, for example SHA256.
If an adversary has prior knowledge of the hash function, they can find keys that will always collide to the same slot. This is called an algorithmic complexity attack[^1]. One approach is to select from a family of hash functions at random to prevent adversaries from gaining prior knowledge of the hash function used.
## chaining
One way to handle collisions is through chaining. In chaining, the key is associated with a list that can store more than one key, value pairs. Simply search the list at the index provided by the hash function of the key for the specific key needed. In the worst case, all elements collide into the same index, and the [[complexity]] is $O(n)$ for find, insert and delete operations. The average case complexity is the number of elements in the table over the number of slots in the table $n/m$ where $n$ is the number of keys and $m$ is the number of slots. This is called the **load factor**.
## open address hashing
In open address hashing, the hash table has one value per key, unlike chaining. In the case of a collision, **probe sequence** is used to find an alternative slot to store the key. If no solution can be found, rehash the table.
To lookup a key, use the probe sequence again to search from the index at the hash function of the key. You can stop looking when you reach an empty slot (it is not present). Approaches like quadratic probing sequences and double hashing can be used to avoid the clustering problem where keys are clustered in the hash table.
To delete a key, simply replace the key with a special deletion flag so as not to affect the probing sequence that encounters the key when looking through the table for alternative slots for keys stored under collisions.
Open address hashing is used by [[Python]] to implement native dictionaries.
## universal hash function
Randomly selecting a hash function from a universe of hash functions for every hash table is created is referred to as a universal hash function. The universal property states that the probability of a collision is $P(h(k_1) = h(k_2)) \le 1/m$ where $k_1 \ne k_2$. In other words, the average case complexity of a universal hash function is approximately equal to the load factor.
## perfect hashing
Perfect hashing is a hashing scheme using a universal hash function and a two-level hash table. Instead of chaining, collisions are handled by moving to a second-level hash table. Technically the second-level hash table is the "perfect" hash table.
The conditions under which perfect hashing is available are strict: all keys need to be known up front and you cannot insert or delete new keys. However, under these conditions perfect hashing is very efficient.
1. Choose a random hash function
2. Create a hash table with $Kn^2$ slots where $K=2c$
3. Insert every key into the hash table
4. If collision happens: abort and redo procedure
## cuckoo hashing
Cuckoo hashing is a clever implementation of a hash table with guaranteed constant time for lookup and deletion. Cuckoo hashing uses two hash tables and each key is guaranteed to be in either its position in the first table or the second table, thus lookup is very fast.
On insert, if you find an item in the correct slot in the first hash table, check the second hash table. If that is also occupied, displace that key with the new key. Continue displacing keys until you end the **displacement chain**. If the displacement chain goes for too long (such as $log(n)$), rehash and select a new hash function.
## Bloom filters
Bloom filters are a fast **set** data structure based on hashing. Bloom filters support insert and checking membership. Checking membership is approximate, false positives are possible but unlikely.
Using $k$ randomly chosen hash functions, set bits in a bit string where the hash function evaluates for a key. This is the **signature** of the key. Check membership by looking for the signature of the key. It is possible that two keys result in the same signature but we reduce the probability of this.
The probability of a false positive, where we assume that the status of each bit is independent, is
$(1-e^{-kn/m})^k$
where $n$ is the number of elements, $m$ is the bit vector size, and $k$ is the number of hash functions used.
Bloom filters are used in web servers to identify when content is stored in the cache versus when it must be looked up on the server.
## count-min sketches
Count-min sketches maintain an approximate count of the number of elements in an input (e.g., the count of each word in a text).
Count-min sketches utilize a hash function to map items to counters. When each item is encountered, the counter is incremented. Where there is a collision, the count will be artificially inflated. We can use multiple banks of counters to replicate the count for each item using a different hash function to minimize the probability of count inflation. The probability of inflation above some threshold can be fixed by specifying the number of banks and the number of counters in each bank.
To calculate the number of counters $m$ and banks $k$ for an accuracy of $\epsilon$ and confidence of $\delta$
$m = \frac{e}{\epsilon}, \ k = -ln(1-\delta)$
## hash functions in cryptography
### Merkle trees
Merkle trees are a flavor of hash functions that underly block chain technology and Bitcoin.
## hash functions in string matching
Does given pattern $P$ of size $m$ occur in a string $S$ of size $n$? This problem is found in text editors, basic search, genetics, and many other contexts.
A naive algorithm would traverse the string and check each candidate character by character, which would run in $\Theta(m \times (n-m+1)$ which can be very expensive for large patterns or large texts.
Hash functions can speed up pattern matching by first comparing the hash of the pattern with the hash of the substring being inspected and only compare if the hash is the same. However, by itself this does not change the complexity of search but can be employed creatively, as in Rabin-Karp.
### Rabin-Karp
Rabin-Karp uses a rolling hash function that can be updated very quickly element-by-element. With a low probability of collision, search can be implemented in $\Theta(m + n)$ time.
## hash functions in common substring matching
Does the same substring of size $m$ appear in two inputs? This is used in comparing texts, for example the commit diffs in [[git]], plagiarism detection, and other applications.
A naive algorithm would be to use some efficient string matching algorithm to check each pattern one-by-one.
A better implementation would be to hash all strings of size $m$ in the input into a hash table. You can use a perfect hash table as all keys are known up front. If there are no spurious collisions, this runs in $\Theta(n_1 + n_2 + m)$. Otherwise in $\Theta(n_1 + m \times (n_2 - m + 1))$.
[^1]: [Denial of Service via Algorithmic Complexity Attacks](https://www.usenix.org/conference/12th-usenix-security-symposium/denial-service-algorithmic-complexity-attacks), Scott A. Corsby and Dan S. Wallach, Proc. USENIX Security Symposium 12 Aug. 2003.