type
status
date
slug
summary
tags
category
icon
password
A probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways.
we will not be storing the complete values of data stream , instead we will use a matrix to compute the frequency, where number of rows would be number of Hash functions we are using and columns would be number of out come of the HashFunctions.(one ans every time, not multiple ans in one output)
it guarantees to give the exact frequency or more.
Frequency Estimation
A frequency estimator is a data structure that supports
- increment(x), which increments the number of times that we’ve seen x
- estimate(x), which returns an estimate of how many times we’ve seen x.
CM Sketch
Count-Min sketch is mostly used to find Heavy Hitters in a data set, that is, all elements whose frequency > T
H1(A) = 1, H2(A) = 3, H3(A) = 1, H4(A)=2
H1(B)= 3 ,H2(B)= 5 , H3(B) = 3 , H4(B) =1
H1(D)= 2 ,H2(D)= 1 , H3(D) = 4 , H4(D) =4
用多个hash结果组成一个pattern,我们只记录count。当然这个pattern存在被污染的风险
统计网络数据流中某个元素出现的频率,反应数据流的特征。并不实际的存储数据流中的元素,只存储他们的计数(元素多种多样,collosion很正常)
Why it works
It’s proven in the paper by Muthukrishnan and Cormode that for a sketch of size with total count , it follows that any estimate has error at most , with probability at least . So setting the parameters and large enough allows us to achieve very high accuracy while using relatively little space.
Top-K优化
通过partiton,减治法,随机算法
Reference
- 别的Sketch:
- https://zhuanlan.zhihu.com/p/644830210
- https://zaoxing.github.io/papers/2019/SIGCOMM19_NitroSketch.pdf
- Counting Bloom Filter:https://cloud.tencent.com/developer/article/1136056
- 支持删除,通过用cnt代替1 bit,存储更多的信息
- Advanced DS:
Implementation
- URL:/article/1d25e01b-bd7d-4fbf-b8e8-0af58c92437b
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!