Probablistic Data Structures

Probabilistic Data Structures

What?

Allows to do things in-memory at scale, which otherwise is not possible inside MEMORY.
Have small footprint.
Need to accept a predictable level of inaccuracy.

Types

Bloom Filters - are for set membership.
That is check if value is present or not.
Count-Min Sketch -
Does 2 things - check if present, and if yes, keeps track of count of occurrences.
In normal scenarios, we can use hash map; Key-value pair;
Issue - more JVM heap size and more time;
Soln - count-min sketch;
Count + minimum + sketch (i.e. rough idea);
Hashing function must all be different. That is each hash function shall give diff output for same input.
So if 3 functions, then 3 diff output for same input value.

When initialising a sketch, a couple of values to be considered -
* Epsilon - accepted error added to counts with each item; How wrong those counts could be?
* Delta - probability that estimate is outside accepted error; what is the probability of the wrong counts being in tolerable range?

These 2 variables help get us assess what is the Width and Depth of Count Min Sketch.

Width = math.ceil(2/epsilon)
Depth - math.ceil(-math.log(1-delta)/math.log(2))

We can use stream library for this.

In count-min sketch, unlike bloom filter focus is on less memory consumption and the runtime isn’t sufficient.

Eg - if you aren’t caring about count, but only looking at values like what is top 10 things bought today etc, this count min sketch is useful.

Use cases -
* Any kind of freq tracking.
* NLP
* Extension - heavy hitters(twitter trends), range query(Solr).

HyperLogLog (27.31 min) -
It is for cardinality.
For example in my list of things, how many unique items are there, what are unique visitors on my site.

In normal scenario this can be done via Set.
In HyperLogLog -