From Algorithms and Data Structures for Massive Datasets by Dzejla Medjedovic, Emin Tahirovic, and Ines Dedovic This article covers: • Learning what Bloom filters are, why and when they are useful • Understanding how Bloom filters work • Configuring a Bloom filter in a practical setting • Exploring the interplay between Bloom filter parameters |
Take 40% off Algorithms and Data Structures for Massive Datasets by entering fccmedjedovic into the discount code box at checkout at manning.com.
Bloom filters seem to be all the rage these days. Most self-respecting industry blogs have articles fleshing out how Bloom filters enhance the performance in their infrastructure, and there are dozens of Bloom filter implementations floating around in various programming languages, each touting its own benefits. Bloom filters are also interesting to computer science researchers, who have, in past decade, designed many modifications and alternatives to the basic data structure, enhancing its various aspects. A skeptic and a curmudgeon in you might ask himself: What’s with all the hype?
The large part of the reason behind Bloom filter popularity is that they have that combination of being a fairly simple data structure to design and implement that is also very useful in many contexts. They were invented in 1970s by Burton Bloom[1],[2] but they only really “bloomed” in the last few decades with the onslaught of large amounts of data in various domains, and the need to tame and compress such huge datasets.
One simple way to think about Bloom filters is that they support insert and lookup in the same way the hash tables do, but using very little space, i.e., one byte per item or less. This is a significant saving when you have many items and each item takes up, say 8 bytes.
Bloom filters do not store the items themselves and they use less space than the lower theoretical limit required to store the data correctly, and therefore, they exhibit an error rate. They have false positives but they do not have false negatives, and the one-sidedness of this error can be turned to our benefit. When the Bloom filter reports the item as Found/Present
, there is a small chance that it is not telling the truth, but when it reports the item as Not Found/Not Present
, we know it’s telling the truth. So, in the context where the query answer is expected to be Not Present
, Bloom filters offer great accuracy most of the time plus space-saving benefits.
For instance, this is how Bloom filters are used in Google’s Webtable[3] and Apache Cassandra[4] that are among the most widely used distributed storage systems designed to handle massive amounts of data. Namely, these systems organize their data into a number of tables called Sorted String Tables (SSTs) that reside on the disk and are structured as key-value maps. In Webtable, keys might be website names, and values might be website attributes or contents. In Cassandra, the type of data depends on what system is using it, so for example, for Twitter, a key might be a User ID, and the value could be user’s tweets.
When users query for data, a problem arises because we do not know which table contains the desired result. To help locate the right table without checking explicitly on the disk, we maintain a dedicated Bloom filter in RAM for each of the tables, and use them to route the query to the correct table, in the way described in Figure 1:
Figure 1: Bloom filters in distributed storage systems. In this example, we have 50 sorted string tables (SSTs) on disk, and each table has a dedicated Bloom filter that can fit into RAM due to its much smaller size. When a user does a lookup, the lookup first checks the Bloom filters. In this example, the first Bloom filter that reports the item as Present is Bloom filter No.3. Then we go ahead and check in the SST3 on disk whether the item is present. In this case, it was a false alarm. We continue checking until another Bloom filter reports Present. Bloom filter No.50 reports present, we go to the disk and actually locate and return the requested item.
Bloom filters are most useful when they are strategically placed in high-ingestion systems, in parts of the application where they can prevent expensive disk seeks. For example, having an application perform a lookup of an element in a large table on a disk can easily bring down the throughput of an application from hundreds of thousands ops/sec to only a couple of thousands ops/sec. Instead, if we place a Bloom filter in RAM to serve the lookups, this will deem the disk seek unnecessary except when the Bloom filter reports the key as Present
. This way the Bloom filter can remove disk bottlenecks and help the application maintain consistently high throughput across its different components.
In this article, you will learn how Bloom filters work and when to use them, with various practical scenarios. You will also learn how to configure the parameters of the Bloom filter for your particular application: there is an interesting interplay between the space (m), number of elements (n), number of hash functions (k), and the false positive rate (f). For readers who like a challenge, we will spend some time understanding where the formulas relating the important parameters of the Bloom filter come from and exploring whether one can do better than a Bloom filter.
In that light, we will spend a substantial amount of time exploring an interesting new type of a compact hash table called a quotient filter[5] that is functionally similar to the Bloom filter, and also offers many other advantages.
How It Works
A Bloom filter has two main components:
- A bit array
A[0..m-1]
with all slots initially set to 0; and - k independent hash functions h_{1}, h_{2},…, h_{k}, each mapping keys uniformly randomly onto a range [0,m−1]
Insert
To insert an item x into the Bloom filter, we first compute the k hash functions on x, and for each resulting hash, set the corresponding slot of A
to 1 (see pseudocode and Figure 2 below):
Bloom_insert(x): for i ← 1 to k A[h_{i}(x)] ←1
Figure 2: Example of insert into Bloom filter. In this example, an initially empty Bloom filter has m=8, and k=2 (two hash functions). To insert an element x, we first compute the two hashes on x, the first one of which generates 1 and the second one generates 5. We proceed to set A[1] and A[5] to 1. To insert y, we also compute the hashes and similarly, set positions A[4] and A[6] to 1.
Lookup
Similarly to insert, lookup computes k hash functions on x, and the first time one of the corresponding slots of A equal to 0, the lookup reports the item as Not Present, otherwise it reports the item as Present (pseudocode below):
Bloom_lookup(x): for i ← 1 to k if(A[h_{i}(x)] = 0) return NOT PRESENT return PRESENT
Here is an example of a Bloom filter lookup (Figure 3):
Figure 3: Example of a lookup on a Bloom filter. We take the resulting Bloom filter from Figure 2, where we inserted elements x and y. To do a lookup on x, we compute the hashes (which are the same as in the case of an insert), and we return Found/Present, as both bits in corresponding locations equal 1. Then we do a lookup of an element z, which we never inserted, and its hashes are respectively 4 and 5, and the bits at locations A[4] and A[5] equal 1, thus we again return Found/Present. This is an example of a false positive, where two other items together set the bits of the third item to 1. An example of a negative (negative is always true), would be if we did a lookup on an element w, whose hashes are 2 and 5, (0 and 1), or 0 and 3 (0 and 0). If the Bloom filter reports an element as Not Found/Not Present, then we can be sure that this element was never inserted into a Bloom filter.
Asymptotically, the insert operation on the Bloom filter costs O(k). Considering that the number of hash functions rarely goes above 12, this is a constant-time operation. The lookup might also need O(k), in case the operation has to check all the bits, but most unsuccessful lookups will give up way before; later we will see that on average, an unsuccessful lookup takes about 1-2 probes before giving up.
Use Cases
In this section, we will see more applications of Bloom filters to distributed networks: Squid network proxy and Bitcoin mobile app.
Bloom Filters in Networks: Squid
Squid is a web proxy cache—a server that acts as a proxy between the client and other servers when the client requests a webpage, file, etc. Web proxies use caches to reduce web traffic, which means they maintain a local copy of recently-accessed links, in case they are requested again, and this usually enhances performance significantly. One of the protocols[6] designed suggests that a web proxy locally keeps a Bloom filter for each of its neighboring servers’ cache contents. This way when a proxy is looking for a webpage, it first checks its local cache. If a cache miss occurs locally, the proxy checks all its Bloom filters to see whether any of them contain the desired webpage, and if yes, it tries to fetch the webpage from the neighbor associated with that Bloom filter instead of directly fetching the page from the Web.
Squid implements this functionality and it calls Bloom filters Cache Digests[7] (see Figure 4.) Because data is highly dynamic in the network scenario, and Bloom filters are only occasionally broadcasted between proxies, false negatives can arise.
Figure 4: Usage of Bloom filter in Squid web proxy. Web proxies keep the copies of recently-accessed web pages, but also keep the record of recently accessed web pages of their neighbors by having each proxy occasionally broadcast the Bloom filter of their own cache. In this example, a user requests a web page x, and a web proxy A cannot find it in its own cache, so it queries the Bloom filters of B, C and D. The Bloom filter of D reports Found/Present for x, so the request is forwarded to D. Note that, because Bloom filters are not always up-to-date, and the network environment is highly dynamic, by the time we get to the right proxy, the cache might have deleted the resource that we are looking for. Also, false negatives may arise due to the gap in the broadcasting times.
Bitcoin mobile app
Peer-to-peer networks use Bloom filters to communicate data, and a well-known example of that is Bitcoin. An important feature of Bitcoin is ensuring transparency between clients, i.e., each node should be able to see everyone’s transactions. However, for nodes that are operating from a smartphone or a similar device of limited memory and bandwidth, keeping the copy of all transactions is highly impractical. This is why Bitcoin offers the option of simplified payment verification (SPV), where a node can choose to be a light node by advertising a list of transactions it is interested in. This is in contrast to full nodes that contain all the data (Figure 5):
Figure 5: In Bitcoin, light clients can broadcast what transactions they are interested in, and thereby block the deluge of updates from the network.
Light nodes compute and transmit a Bloom filter of the list of transactions they are interested in to the full nodes. This way, before a full node sends information about a transaction to the light node, it first checks its Bloom filter to see whether a node is interested in it. If the false positive occurs, the light node can discard the information upon its arrival.[8]
Configuring a Bloom filter for your application
When using an existing implementation of a Bloom filter, the constructor will allow you to set a couple of parameters and do the rest on its own. For example,
bloom = BloomFilter(max_elements, fp_rate)
allows the user to set the maximum number of elements and the desired false positive rate, and the constructor does the job of setting other parameters (size of the filter and the number of hash functions). Similarly, we can have:
bloom = BloomFilter(fp_rate, bits_per_element)
that allows a user to set the desired false positive and how many bits per element they are willing to spend, and the number of elements inserted and the number of hash functions are additionally set, or simply,
bloom = BloomFilter(),
where the implementation sets parameters to the default values. In the rest of the section, we outline the main formulas relating important parameters of the Bloom filter, which you will need if you decide to implement your own Bloom filter, or to understand how the existing implementations optimally configure the Bloom filter. We will use following notation for the four parameters of the Bloom filter:
- f = the false positive rate
- m = number of bits in a Bloom filter
- n = number of elements to insert
- k = number of hash functions
The formula that determines the false positive rate as a function of other three parameters is as follows (Formula 1):
First let’s reason this formula visually. Figure 6 below shows the plot of f as a function of k for different choices of m/n (bits per element). In many real-life applications, fixing bits-per-element ratio is meaningful because we often have an idea of how many bits we can spend per element. Common values for the bits-per-element ratio are between 6 and 14, and such ratios allow us fairly low false positive rates as shown in the graph below:
Figure 6: The plot relating the number of hash functions (k) and the false positive rate (f) in a Bloom filter. The graph shows the false positive rate for a fixed bits-per-element ratio (m/n), different curves corresponding to different ratios. Starting from the top to bottom, we have m/n=6, 8, 10, 12, 14. As the amount of allowed space per element increases (going from top to bottom), given the same number of hash functions, the false positive rate drops. Also, the curves show the trend that increasing k up until some point (going from left to right), for a fixed m/n, reduces the error, but after some point, increasing k increases the error rate. Note that the curves are fairly smooth, and for example, when m/n=8, i.e., we are willing to spend 1 byte per element, if we use anywhere between 4 and 8 hash functions, the false positive rate will not go above 3%, even though the optimal choice of k is between 5 and 6.
While increasing m or reducing n drops the false positive rate, i.e., more bits per element results in the overall lower false positive curve, the graph also shows the two-fold effect that k has on false positives: up to a certain point, increasing k helps reduce false positives, but there is a point at which it starts to worsen it; this is because having more hash functions allows a lookup more chance to find a zero, but also on an insert, sets more bits to 1. The minimum for each curve is the sweet spot that is the optimal k for a particular bits-per-element (which we get by doing a derivative on Formula 1 with respect to k), and it happens at (Formula 2):
For example, when m/n = 8, k_{opt} = 5.545. We can use this formula to optimally configure the Bloom filter, and an interesting consequence of choosing parameters this way is that in such a Bloom filter, the false positive rate turns out to be (Formula 3):
This is particularly convenient, considering that a false positive occurs when a lookup encounters a cell whose value is 1 k times in a row, which means that in an optimally filled Bloom filter the probability of a bit being equal to 1 is ½.
Keep in mind that these calculations assume k is a real number, but our k has to be an integer. So if Formula 2 produces a non-integer, and we need to choose one of the two neighboring integers, then Formula 3 is also not an exact false positive rate anymore. The only correct formula to plug into is Formula 1, but even with Formula 3, we will not make too grave of a mistake. Often it is better to choose the smaller of the two possible values of k, because it reduces the amount of computation we need to do.
Examples
Here we show some examples of how to configure Bloom filters in different situations.
Example 1. Calculating f from m, n, and k
You are trying to analyze the false positive rate of an already existing Bloom filter that has been acting up. The filter capacity is 3MB, and over time it ended up storing 10^{7} elements (~2.5 bits per element) and it uses 2 hash functions.
Answer for Example 1:
Using Formula 1, we obtain the following:
Example 2. Calculating f and k from n and m
Consider you wish to build a Bloom filter for n = 10^{6} elements, and you have about 1MB available for it (m = 8 ∗ 10^{6} bits). Find the optimal false positive rate and determine the number of hash functions.
From Formula 2, the ideal number of hash functions should be k ≈ 0.693 ∗ 8 ∗ 10^{6} ⁄ 10^{6} = 5.544. Formula 3 tells us that the false positive rate is f ≈ (1⁄2)^{5.544} ≈ 0.0214, but we need a legal value of k. In this situation, we might choose k = 5 or k = 6. In both cases, we will still obtain 2% false positive rate.
Answer for Example 2:
Consider re-doing the Example 2 where the dataset becomes 100 times larger, and false positive rate is kept fixed: if we do the math, we will see that we will also require approximately 100 times larger Bloom filter. Therefore, Bloom filters grow linearly with the size of the dataset and even though they are intended to be a small signature of the original data, they can also grow large enough to spill over to SSD/disk.
A bit of theory
First let’s see where the main formula for the Bloom filter false positive rate (Formula 1) comes from, as Formulas 2 and 3 are the consequence of minimizing f in Formula 1 with respect to k. For this analysis, we assume that hash functions are independent (the results of one hash function do not in any way affect the results of any other hash function) and that each function maps keys uniformly randomly over the range [0… m − 1].
If t is the fraction of bits that are still 0 after all n insertions took place, and k is the number of hash functions, then the probability f of a false positive is:
f = (1 − t)^{k}
considering that we need to get k 1s in order to report Present. It is impossible to know beforehand what t will be, because it depends on the outcome of hashing, but we can work with probability p of a bit being equal to 0 after all inserts took place, i.e.:
p = Prob(a fixed bit equals 0 after n inserts)
The value of p will in the probabilistic sense, translate to the percentage of 0s in the filter. Now we derive the value of p to be equal the following expression:
To understand why this is true, let’s start from the empty Bloom filter. Right after the first hash function h_{1} has set one bit to 1, the probability that a fixed bit in the Bloom filter equals 1 is 1⁄m, and the probability that it equals 0 is accordingly 1 − 1⁄m. After all the hashes of the first insert finished setting bits to 1, the probability that the fixed bit still equals zero is (1−1⁄m)^{k}, and after we finished inserting the entire dataset of size n, this probability is (1−1⁄m)^{nk} . The approximation (1−1⁄x)^{x} ≈ e then further gives p ≈ e^{–nk/m}.
If we just replace t from the earlier expression f = (1 − t )^{k} with our new value of p, we will obtain Formula 1. But to make this replacement kosher, we first have to prove that p is a random variable that is very stably concentrated around its mean, and this can be proved using Chernoff bounds. This means that it is exponentially unlikely that p will differ substantially from t, so it is safe to replace one with the other, thus giving Formula 1.
Can we do better?
Bloom filter packs the space really well but are there, or are there better data structures? In other words, for the same amount of space, can we achieve a better false positive rate than the Bloom filter? To answer this question, we need to derive a lower bound that relates the space in the Bloom filter (m) with the false positive rate (f). This lower bound (available in some more theoretical resources on the subject) tells us that the amount of space the Bloom filter uses is 1.44x away from the minimum. There are, in fact, data structures that are closer to this lower bound than Bloom filter, but some of them are very complex to understand and implement.
Further reading: Bloom filter adaptations and alternatives
The basic Bloom filter data structure leaves a lot to be desired, and computer scientists have developed various modified versions of Bloom filters that address its various inefficiencies. For example, the standard Bloom filter does not handle deletions. There is a version of Bloom filter called counting Bloom filter[9] that uses counters instead of individual bits in the cells. The insert operation in the counting Bloom filter increments the respective counters, and the delete operation decrements the corresponding counters. Counting Bloom filters use more space and can also lead to false negatives, when, for example, we repeatedly delete the same element thereby bringing down some other elements’ counters to zero.
Another issue with Bloom filters is their inability to be efficiently scaled. One of the problems with scaling in the way we are used to with hash tables, by rehashing and re-inserting, is that we do not store the items nor the fingerprints in the Bloom filter, so the original keys are effectively lost and rehashing is not an option.
Also, Bloom filters are vulnerable when the queries are not drawn uniformly and randomly. Queries in real-life scenarios are rarely uniform random. Instead, many queries follow the Zipfian distribution, where a small number of elements is queried a large number of times, and a large number of elements is queried only once or twice. This pattern of queries can increase our effective false positive rate, if one of our “hot” elements, i.e., the elements queried often, results in the false positive. A modification to the Bloom filter called weighted Bloom filter[10] addresses this issue by devoting more hashes to the “hot” elements, thus reducing the chance of the false positive on those elements. There are also new adaptations of Bloom filters that are adaptive, i.e. upon the discovery of a false positive, they attempt to correct it.[11]
The other vein of research has been focused on designing data structures functionally similar to the Bloom filter, but their design has been based on particular types of compact hash tables. Quotient filters are a viable alternative to Bloom filters, but they deserve their own article (we cover them extensively in the book).
Summary
- Bloom filters have been widely applied in the context of distributed databases, networks, bioinformatics, and other domains where regular hash tables are too space-consuming.
- Bloom filters trade accuracy for the savings in space, and there is a relationship between the space, false positive rate, the number of elements and the number of hash functions in the Bloom filter.
- Bloom filters do not meet the space vs. accuracy lower bound, but they are simpler to implement than more space-efficient alternatives, and have been adapted over time to deal with deletes, different query distributions, etc.
That’s all for this article.
If you want to learn more about the book, check it out on browser-based liveBook platform here.
[1] B. H. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.
[2] A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A Survey,” in Internet Mathematics, 2002, pp. 636-646.
[3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes and R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 4:1-4:26, 2008.
[4] S. Lebresne, “The Apache Cassandra Storage Engine,” 2012. [Online]. Available: https://2012.nosql-matters.org/cgn/wp-content/uploads/2012/06/Sylvain_Lebresne-Cassandra_Storage_Engine.pdf. [Accessed 03 04 2016].
[5] M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane and E. Zadok, “Don’t Thrash: How to Cache Your Hash on Flash,” in Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 11, pp. 1627-1637 (2012), 2012.
[6] L. Fan, P. Cao, J. Almeida and A. Z. Broder, “Summary Cache: A Scalable Wide-area Web Cache Sharing Protocol,” IEEE/ACM Trans. Netw., vol. 8, no. 3, pp. 281-293, 2000.
[7] Squid, “Squid Cache Wiki,” [Online]. Available: http://wiki.squid-cache.org/SquidFaq/AboutSquid. [Accessed 19 03 2016].
[8] A. Gervais, S. Capkun, G. O. Karame and D. Gruber, “On the privacy provisions of Bloom filters in lightweight bitcoin,” in Proceedings of the 30th Annual Computer Security Applications Conference (ACSAC 2014), New Orleans, LA, 2014.
[9] L. Fan, P. Cao, J. Almeida and A. Z. Broder, “Summary Cache: A Scalable Wide-area Web Cache Sharing Protocol,” IEEE/ACM Trans. Netw., vol. 8, no. 3, pp. 281-293, 2000.
[10] J. Bruck, J. Gao and A. (. Jiang, “Weighted Bloom Filter,” in IEEE International Symposium on Information Theory, 2006.
[11] M. A. Bender, M. Farach-Colton, M. Goswami, R. Johnson, S. McCauley and S. Singh, “Bloom Filters, Adaptivity, and the Dictionary Problem,” in IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018.