場景
- 爬蟲時判斷某個URL是否已經(jīng)被爬取過
- 黑名單過濾
- 防止緩存穿透
- ...
實現(xiàn)原理
定義一個長度為m的bit型數(shù)組flag[] (用來添加元素以及判斷元素是否存在,因為Integer的最大值為2147483647,所以m取該值即可)
定義n個不同的hash函數(shù)(在添加元素時,需要設(shè)置flag[]哪些位為1; 在判斷元素是否存在時,需要取flag[]哪些位來判斷)
添加某個元素時,通過n個hash函數(shù)算出該元素的n個hash值(整型值),把flag[]對應(yīng)的位置1
判斷某個元素是否存在時,通過n個hash函數(shù)算出該元素的n個hash值(整型值),在flag[]取出對應(yīng)的值,只要有一個不為1 ,即可判斷為不存在.否則就任務(wù)元素存在
布隆過濾器優(yōu)缺點
優(yōu)點: 大大節(jié)省空間
場景: 在10億數(shù)據(jù)中判斷某個數(shù)據(jù)是否存在
如果使用HashSet/HashMap來實現(xiàn)的話
查找的時間復(fù)雜度是O(1),但是我們來算一下存儲空間,Hash值為Integer類型,占四個字節(jié),那10億條數(shù)據(jù)占用的空間就是:10億*4/1024/1024/1024約等于3.7G......這個實現(xiàn)方案很明顯不現(xiàn)實
如果使用布隆過濾器實現(xiàn)
占用的空間大約為2147483647/8/1024/1024=256M
缺點: 誤差
由上面的分析可知, hash函數(shù)是存在hash沖突的, 所以布隆過濾器是會有誤判的情況.
表現(xiàn)為:
如果某條記錄被判斷為不存在,則該記錄必然不存在
如果某條記錄被判斷存在,則該記錄可能會不存在
代碼實現(xiàn)
JDK版
public class BloomFilterDemo {
private static final int insertions = 1000000;
@Test
public void bfTest1(){
//初始化一個存儲string數(shù)據(jù)的布隆過濾器息楔,初始化大小100w,不能設(shè)置為0
BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions,0.001);
//初始化一個存儲string數(shù)據(jù)的set涮拗,初始化大小100w
Set<String> sets = new HashSet<>(insertions);
//初始化一個存儲string數(shù)據(jù)的set奏司,初始化大小100w
List<String> lists = new ArrayList<>(insertions);
//向三個容器初始化100萬個隨機并且唯一的字符串---初始化操作
for (int i = 0; i < insertions; i++) {
String uuid = UUID.randomUUID().toString();
bf.put(uuid);
sets.add(uuid);
lists.add(uuid);
}
//布隆過濾器錯誤判斷的次數(shù)
int wrong = 0;
//布隆過濾器正確判斷的次數(shù)
int right = 0;
for (int i = 0; i < 10000; i++) {
//按照一定比例選擇bf中肯定存在的字符串
String test = i%100==0?lists.get(i/100):UUID.randomUUID().toString();
if(bf.mightContain(test)){
if(sets.contains(test)){
right ++;
}else{
wrong ++;
}
}
}
//100
System.out.println("=================right====================="+right);
System.out.println("=================wrong====================="+wrong);
}
@Test
public void bfTest2() {
//預(yù)計要插入多少數(shù)據(jù)
int size = 1000000;
//期望的誤判率
double fpp = 0.01;
BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);
//插入數(shù)據(jù)
for (int i = 0; i < 1000000; i++) {
bloomFilter.put(i);
}
int count = 0;
for (int i = 0; i < 1000000; i++) {
if (!bloomFilter.mightContain(i)) {
count++;
System.out.println(i + "誤判了");
}
}
for (int i = 1000000; i < 2000000; i++) {
if (bloomFilter.mightContain(i)) {
count++;
System.out.println(i + "誤判了");
}
}
System.out.println("總共的誤判數(shù):" + count);
}
}
redis版
/**
*
* 原理是和 JDK 自帶的BloomFilter類似的,我們看add方法,它先 Redis 緩存中是否有指定 key(如:orderBloomFilter) 的值理澎,如果沒有狞洋,則在 offset = 0 處,添加一個值為false颜武,即為 0;
* 然后調(diào)用createHashes(byte[] data, int hashes)方法拖吼,根據(jù)字節(jié)數(shù)組的內(nèi)容生成digest鳞上,并將結(jié)果分割成 4 字節(jié)的整數(shù)并存儲在數(shù)組中,數(shù)組中的整數(shù)可以理解為每次hash所得的hashcode的值吊档。
* 最后篙议,遍歷hashcode數(shù)組,將hashcode%sizeOfBloomFilter取模所得下標(biāo)所對應(yīng)的值設(shè)為true怠硼,即為 1鬼贱。
*
* contains方法,同樣調(diào)用createHashes(byte[] data, int hashes)得到字節(jié)數(shù)組內(nèi)容所對應(yīng)的hashcode數(shù)組香璃。
* 遍歷hashcode數(shù)組这难,如果有一個hashcode所對應(yīng)的下標(biāo)的值不為1,則該數(shù)據(jù)不存在增显。反之雁佳,只有所有的hashcode所對應(yīng)的下標(biāo)的值都為1脐帝,才能說明該數(shù)據(jù)已經(jīng)存在。
*
* @author: ZENGQINGXUN178
* @Date: 2019-8-27 10:30
* @Description:
*/
@Component
public class BloomFilterService<E> {
@Autowired
private JedisCluster jedisCluster;
/**
* total length of the Bloom filter
*/
private static int sizeOfBloomFilter;
/**
* number of hash functions
*/
private static int numberOfHashFunctions;
/**
* 誤差率
*/
private static final double falsePositiveProbability = 0.01;
/**
* 預(yù)計容量
*/
private static final int expectedNumberOfElements = 10000;
private static String bloom_name = "bloom_name_1";
/**
* encoding used for storing hash values as strings
*/
private final Charset charset = Charset.forName("UTF-8");
/**
* MD5 gives good enough accuracy in most circumstances. Change to SHA1 if it's needed
*/
private static final String hashName = "MD5";
private static final MessageDigest digestFunction;
/**
* The digest method is reused between instances
*/
static {
MessageDigest messageDigest;
try {
messageDigest = java.security.MessageDigest.getInstance(hashName);
} catch (NoSuchAlgorithmException e) {
messageDigest = null;
}
digestFunction = messageDigest;
// numberOfHashFunctions = ceil(-ln(falsePositiveProbability)/ln2)
numberOfHashFunctions = (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)));
// sizeOfBloomFilter = ceil(numberOfHashFunctions*expectedNumberOfElements/ln2)
sizeOfBloomFilter = (int) Math.ceil(numberOfHashFunctions * expectedNumberOfElements / Math.log(2));
}
@PostConstruct
public void init(){
// 1. 獲取數(shù)據(jù)
List<String> datas = new ArrayList<>();
for (int i = 0; i < expectedNumberOfElements; i++) {
datas.add(i + "");
}
if (jedisCluster.get(bloom_name) == null) {
jedisCluster.setbit(bloom_name, 0, false);
}
// 2. 把數(shù)據(jù)放進bloomFilter
JedisClusterPipeline pipelined = JedisClusterPipeline.pipelined(jedisCluster);
try {
datas.forEach(e -> add(e.getBytes(charset),pipelined));
}finally {
pipelined.syncAndReturnAll();
}
}
/**
* Adds an object to the Bloom filter. The output from the object's
* toString() method is used as input to the hash functions.
*
* @param element is an element to register in the Bloom filter.
*/
public void add(E element) {
add(element.toString().getBytes(charset));
}
private void add(byte[] bytes) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
jedisCluster.setbit(bloom_name, Math.abs(hash % sizeOfBloomFilter), true);
}
}
/**
* Adds an array of bytes to the Bloom filter.
*
* @param bytes array of bytes to add to the Bloom filter.
*/
private void add(byte[] bytes,JedisClusterPipeline pipelined) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
pipelined.setbit(bloom_name, Math.abs(hash % sizeOfBloomFilter), true);
}
}
/**
* Adds all elements from a Collection to the Bloom filter.
*
* @param c Collection of elements.
*/
public void addAll(Collection<? extends E> c) {
for (E element : c) {
add(element);
}
}
/**
* Returns true if the element could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param element element to check.
* @return true if the element could have been inserted into the Bloom filter.
*/
public boolean contains(E element) {
return contains(element.toString().getBytes(charset));
}
public int failCount(List<E> elements) {
JedisClusterPipeline pipelined = JedisClusterPipeline.pipelined(jedisCluster);
int count = 0;
try {
for (E e : elements) {
if (!contains(e.toString().getBytes(charset), pipelined)) {
count++;
}
}
}finally {
pipelined.close();
}
return count;
}
/**
* Returns true if the array of bytes could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param bytes array of bytes to check.
* @return true if the array could have been inserted into the Bloom filter.
*/
private boolean contains(byte[] bytes) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
if (!jedisCluster.getbit(bloom_name, Math.abs(hash % sizeOfBloomFilter))) {
return false;
}
}
return true;
}
private boolean contains(byte[] bytes,JedisClusterPipeline pipelined) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
pipelined.getbit(bloom_name, Math.abs(hash % sizeOfBloomFilter));
}
List<Object> objects = pipelined.syncAndReturnAll();
Optional<Object> any = objects.stream().filter(Boolean.FALSE::equals).findAny();
return any.isPresent();
}
/**
* Returns true if all the elements of a Collection could have been inserted
* into the Bloom filter. Use getFalsePositiveProbability() to calculate the
* probability of this being correct.
*
* @param c elements to check.
* @return true if all the elements in c could have been inserted into the Bloom filter.
*/
public boolean containsAll(Collection<? extends E> c) {
for (E element : c) {
if (!contains(element)) {
return false;
}
}
return true;
}
/**
* 根據(jù)字節(jié)數(shù)組的內(nèi)容生成摘要糖权,并將結(jié)果分割成 4 字節(jié)的整數(shù)并存儲在數(shù)組中堵腹。
* 調(diào)用 digest 函數(shù),直到生成所需的 int 數(shù)星澳。每次生成摘要之前疚顷,都需要加鹽,并且 salt++禁偎。
*
* @param data specifies input data.
* @param hashes number of hashes/int's to produce.
* @return array of int-sized hashes
*/
private static int[] createHashes(byte[] data, int hashes) {
int[] result = new int[hashes];
int k = 0;
byte salt = 0;
while (k < hashes) {
byte[] digest;
synchronized (digestFunction) {
digestFunction.update(salt);
salt++;
digest = digestFunction.digest(data);
}
for (int i = 0; i < digest.length / 4 && k < hashes; i++) {
int h = 0;
for (int j = (i * 4); j < (i * 4) + 4; j++) {
h <<= 8;
h |= ((int) digest[j]) & 0xFF;
}
result[k] = h;
k++;
}
}
return result;
}
/**
* Calculates a hash code for this class.
*
* @return hash code representing the contents of an instance of this class.
*/
@Override
public int hashCode() {
int hash = 7;
hash = 61 * hash + sizeOfBloomFilter;
hash = 61 * hash + expectedNumberOfElements;
hash = 61 * hash + numberOfHashFunctions;
return hash;
}
}