1133. Data Structure - Bloom Filter
Bloom Filter


Bloom filter and its implementation.

1. Bloom Filter

Bloom filter is a data structure designed to rapidly and memory-efficiently tell whether an element is present in a set.

The cost paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

The base data structure of a Bloom filter is a Bit Vector. image

2. Implementation

Use an integer array as bit vector. Each cell has the value either 0 or 1. One key will be hashed by three hash functions, then it will be stored into the array.

public class BloomFilter {
    private int capacity;
    private int[] array;
    public BloomFilter(int capacity) {
        this.capacity = capacity;
        array = new int[capacity];
    }

    public void add(String key) {
        int first = hash_function1(key);
        int second = hash_function2(key);
        int third = hash_function3(key);
        array[first%capacity] = 1;
        array[second%capacity] = 1;
        array[third%capacity] = 1;
    }

    public boolean contains(String key) {
        int first = hash_function1(key);
        int second = hash_function2(key);
        int third = hash_function3(key);

        int firstIndex = array[first % capacity];
        if (firstIndex == 0) {
            return false;
        }

        int secondIndex = array[second % capacity];
        if (secondIndex == 0) {
            return false;
        }

        int thirdIndex = array[third % capacity];
        if (thirdIndex == 0) {
            return false;
        }

        return true;
    }

    private int hash_function1(String key) {
        int hash = 0;
        for (int i = 0; i < key.length(); ++i) {
            hash = 33 * hash + key.charAt(i);
        }

        return Math.abs(hash);
    }

    private int hash_function2(String key) {
        final int p = 16777619;
        int hash = (int) 2166136261L;
        for (int i = 0; i < key.length(); ++i) {
            hash = (hash ^ key.charAt(i)) * p;
        }
        hash += hash << 13;
        hash ^= hash >> 7;
        hash += hash << 3;
        hash ^= hash >> 17;
        hash += hash << 5;
        return Math.abs(hash);
    }

    private int hash_function3(String key) {
        int hash, i;
        for (hash = 0, i = 0; i < key.length(); ++i) {
            hash += key.charAt(i);
            hash += (hash << 10);
            hash ^= (hash >> 6);
        }
        hash += hash << 13;
        hash ^= hash >> 11;
        hash += hash << 15;
        return Math.abs(hash);
    }
}

3. Performance Test

3.1 HashSet

We define a hashset with capacity 100,000,000 and insert 10,000,000 entries into the set. Then, we assert several elements to test whether they are in the set.

public class HashSetTest {
    @Test
    public void testHashSet() {
        int capacity = 100000000;
        int count = capacity / 10;
        long start = System.currentTimeMillis();
        Set<Integer> set = new HashSet<>(capacity);
        for (int i = 0; i < count; i++) {
            set.add(i);
        }
        assertTrue(set.contains(1));
        assertTrue(set.contains(2));
        assertTrue(set.contains(3));
        assertTrue(set.contains(999999));
        assertFalse(set.contains(10000001));
        assertFalse(set.contains(400230340));

        long end = System.currentTimeMillis();
        System.out.println("Executed Time: " + (end - start));
    }
}

It takes more than 6 seconds to execute the test.

Executed Time: 6442

3.2 Bloom Filter

Create a similar test with the Bloom Filter structure we built previously.

class BloomFilterTest {
    @Test
    public void testBloomFilter() {
        int capacity = 100000000;
        int count = capacity / 10;
        long start = System.currentTimeMillis();
        BloomFilter bloomFilter = new BloomFilter(capacity);
        for (int i = 0; i < count; i++) {
            bloomFilter.add(i + "");
        }
        assertTrue(bloomFilter.contains(1 + ""));
        assertTrue(bloomFilter.contains(2 + ""));
        assertTrue(bloomFilter.contains(3 + ""));
        assertTrue(bloomFilter.contains(999999 + ""));
        assertFalse(bloomFilter.contains(10000001 + ""));
        assertFalse(bloomFilter.contains(400230340 + ""));

        long end = System.currentTimeMillis();
        System.out.println("Executed Time: " + (end - start));

    }
}

It takes only about 1 second to complete the test, which is much faster than the hash set.

Executed Time: 1171

4. Source Files

5. References