This article explores the Karp Rabin algorithm implemented in the Rolling Hash data structure. The article aims to show how rolling hashes can speed up inefficient ‘String matching’ problem.
Karp Rabin
The Rabin Karp algorithm is an example of a Rolling Hash algorithm that can be used in string matching. The ‘Rolling’ comes from the adding and subtracting of a character in the rolling hash window. Each character in string is evaluated only once.
The Rolling hash is an abstract data type that supports:
- hash function
- append(value)
- skip(value)
- Internal list.
The Strings to be matched are stored as Characters in a list. Each character is treated as integers with their values depending on the encoding used. For example, a string with an ASCII encoding represents each item in the list as a number in its base (ASCII has base 256). This means ‘A’ as 65, B as 66, etc…
Implementation
- Hash function
The hash function used in Karp Rabin’s Rolling hash is:
where …
- C is the- base value for the character
- k is the length of the target substring.
- base is usually the prime number > 256 since ASCII = 256
The efficiency in the Rolling hash comes from not needing to re-compute the hash for the entire length of string k. This efficiency is possible because the rehashing involves removing the most significant digit and add the new least significant digit for the hash value without the need to recompute the hash for the full length of string k.
Example 1
An array of [1,2,3,4,5,6], and using a hash window of length k of 3.
Rolling Hash for 3 items | Note | |
---|---|---|
hash(123 ) |
123 |
|
remove(1 ) |
Remove most significant digit. | 23 |
add(4 ) |
Add least significant digit. | (123 – 1 * 100) * 10 + 4 = 234 |
Note: 100 (most significant base) in (1 * 100) comes from 2 ^ 10 (base 10 numbers).
1 | function rollingHash(data = []) { |
Example 2
String value of abcdef
matching bc
. Therefore k=2
Rolling Hash for 2 chars | |
---|---|
hash(ab ) |
(97 * 257^2-1) + (98 * 257^1) => 24,929 + 25,186 = 50,115 |
remove(a ) |
(97 * 257^2-1) = 24,929 |
add(c ) |
50,115 - (97 * 257) + (99 * 257) |
50,115 - 24929 + (99 * 257) | |
25,186 + 25,443 = 50,629 |
The below shows the hash calculated for bc
, this hash value (50,629) matches the hash generated using rolling hashing.
String | |
---|---|
bc |
(98 * 257^2-1) + (99 * 257^1) = 25,186 + 25,443 = 50,629 |
1 | // Note the base is very important as it affects the correctness of the hash |
Time complexity
Assuming the text is length n and the length of the word is m, the average and best-case running time of Rabin-Karp is O(n + m).
The worst-case time is O(nm), this occurs when all characters of pattern and text are same as the hash values causing lots of hash collisions. (essentially doing a character by character comparison)
Source - Karp Rabin
Source - Karp Rabin Q&A
Source - Rolling hash