Count Min Sketch in Pythonic Art

Mohammed Zaheeruddin Malick
2 min readJan 21, 2020

Constant time probabilistic data structure for counting streaming events, in Python.

Count Min Sketch is a fast and highly space efficient probabilistic data structure that serves as a frequency table of events in a stream, Similar to Bloom Filter, Count Min Sketch utilizes fast non-cryptographic hash functions (k) for mapping events to frequencies. Count Min Sketch guarantees zero under-counting and tunable probability over-counting, The accuracy of the sketch is a function of the number of hash functions (k) and the number of slots per hash function (m).

This code illustrates an example use case for count min sketch which is to count the viewers watching all sci-fi movies available on any streaming service.

Part II of the 2020 weekend coding series on probabilistic data structures.

Part I: https://medium.com/@mohammedzu/bloom-filters-in-pythonic-art-4226657db71a

Errata in the hand-written notes below: In the table H3(3) is 4 and Min(H1…K(‘Lion’)) is 4 as well, not 6, which is the frequency of the event ‘Lion’, ‘Lion’ and ‘Cat’ have collisions at H1(1), H2(2) and H3(4)

Try running the program multiple times for instances of over-counting.

$time python3 count_min_sketch.py

Trial: 1/3 [Initializing...]
Filter Size: 10977 Hash Functions: 3 Unique Items: 173
Loose Limit for Events: 1295
Generating Random Click Stream and counting... Please wait
Stream Size: 224035
Validating Counter Sketch...
Over-counted (0/173): {}

Trial: 2/3 [Initializing...]
Filter Size: 10977 Hash Functions: 3 Unique Items: 173
Loose Limit for Events: 1295
Generating Random Click Stream and counting... Please wait
Stream Size: 224035
Validating Counter Sketch...
Over-counted (1/173): {'Resident Evil (2002)': {'True Count': 452, 'Min Count': 487, 'Count Vector': [487, 1969, 1461]}}

Trial: 3/3 [Initializing...]
Filter Size: 10977 Hash Functions: 3 Unique Items: 173
Loose Limit for Events: 1295
Generating Random Click Stream and counting... Please wait
Stream Size: 224035
Validating Counter Sketch...
Over-counted (1/173): {'A Scanner Darkly (2006)': {'True Count': 1324, 'Min Count': 1959, 'Count Vector': [8671, 1959, 2316]}}

Accuracy over 3 Trials: 0.9961464354527938

real 0m10.525s
user 0m10.417s
sys 0m0.050s

--

--