Count Min Sketch in Pythonic Art

Mohammed Zaheeruddin Malick
2 min readJan 21, 2020

Constant time probabilistic data structure for counting streaming events, in Python.

Count Min Sketch is a fast and highly space efficient probabilistic data structure that serves as a frequency table of events in a stream, Similar to Bloom Filter, Count Min Sketch utilizes fast non-cryptographic hash functions (k) for mapping events to frequencies. Count Min Sketch guarantees zero under-counting and tunable probability over-counting, The accuracy of the sketch is a function of the number of hash functions (k) and the number of slots per hash function (m).

This code illustrates an example use case for count min sketch which is to count the viewers watching all sci-fi movies available on any streaming service.

Part II of the 2020 weekend coding series on probabilistic data structures.

Part I: https://medium.com/@mohammedzu/bloom-filters-in-pythonic-art-4226657db71a

Errata in the hand-written notes below: In the table H3(3) is 4 and Min(H1…K(‘Lion’)) is 4 as well, not 6, which is the frequency of the event ‘Lion’, ‘Lion’ and ‘Cat’ have collisions at H1(1), H2(2) and H3(4)

Try running the program multiple times for instances of over-counting.

$time python3 count_min_sketch.py

Trial: 1/3 [Initializing...]
Filter Size: 10977 Hash Functions: 3 Unique Items: 173
Loose Limit for Events: 1295
Generating Random Click Stream and counting... Please wait
Stream Size: 224035
Validating Counter Sketch...
Over-counted (0/173): {}

Trial: 2/3 [Initializing...]
Filter Size: 10977 Hash Functions: 3 Unique Items: 173
Loose Limit for Events: 1295
Generating Random Click Stream and counting... Please wait
Stream Size: 224035
Validating Counter Sketch...
Over-counted (1/173): {'Resident Evil (2002)': {'True Count': 452, 'Min Count': 487, 'Count Vector': [487, 1969, 1461]}}

Trial: 3/3 [Initializing...]
Filter Size: 10977 Hash Functions: 3 Unique Items: 173
Loose Limit for Events: 1295
Generating Random Click Stream and counting... Please wait
Stream Size: 224035
Validating Counter Sketch...
Over-counted (1/173): {'A Scanner Darkly (2006)': {'True Count': 1324, 'Min Count': 1959, 'Count Vector': [8671, 1959, 2316]}}

Accuracy over 3 Trials: 0.9961464354527938

real 0m10.525s
user 0m10.417s
sys 0m0.050s

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Mohammed Zaheeruddin Malick
Mohammed Zaheeruddin Malick

Written by Mohammed Zaheeruddin Malick

Tireless Learner | Geek | Father of two budding Women leaders!

Responses (1)

Write a response

hello ,i am new in this I want to ask how can i use this to count flow in a binary dataset file
Count exactly how many packets for every five tuple flow (Flow size counting).
Count approximately every flow size both amount of memory. data file is a…

--