开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 数据结构与算法 -> 数据结构进阶---哈希 -> 正文阅读

[数据结构与算法]数据结构进阶---哈希

●🧑个人主页:你帅你先说.
●📃欢迎点赞👍关注💡收藏💖
●📖既选择了远方，便只顾风雨兼程。
●🤟欢迎大家有问题随时私信我！
●🧐版权：本文由[你帅你先说.]原创，CSDN首发，侵权必究。

📌📌📌为您导航📌📌📌

1.unordered系列关联式容器
2.哈希
3.unordered_set和unordered_map模拟实现
- 3.1unordered_map模拟实现
- 3.2unordered_set模拟实现
4.位图
- 4.1经典面试题
- 4.2位图应用

1.unordered系列关联式容器

在C++98中，STL提供了底层为红黑树结构的一系列关联式容器，在查询时效率可达到 $log_2N$ ，即最差情况下需要比较红黑树的高度次，当树中的节点非常多时，查询效率也不理想。最好的查询是，进行很少的比较次数就能够将元素找到，因此在C++11中，STL又提供了4个unordered系列的关联式容器，这四个容器与红黑树结构的关联式容器使用方式基本类似，只是其底层结构不同，本文中只对unordered_map和unordered_set进行介绍，unordered_multimap和unordered_multiset。

unordered_set使用

#include <iostream>
#include <unordered_set>
int main()
{
	//去重
	unordered_set<int> us;
	us.insert(5);
	us.insert(1);
	us.insert(5);
	us.insert(3);
	us.insert(7);
	us.insert(2);

	unordered_set<int>::iterator it = us.begin();
	while (it != us.end())
	{
		cout << *it << " ";
		++it;
	}
	cout << endl;

	for (auto e : us)
	{
		cout << e << " ";
	}
	cout << endl;

	unordered_multiset<int> ums;
	ums.insert(5);
	ums.insert(1);
	ums.insert(5);
	ums.insert(3);
	ums.insert(7);
	ums.insert(2);
	for (auto e : ums)
	{
		cout << e << " ";
	}
	cout << endl;
}

我们发现，它的使用和set保持一致，区别在哪呢？区别在于set底层是用哈希封装的，在效率上更胜一筹，同样地，unordered_map也类似。

2.哈希

2.1哈希概念

顺序结构以及平衡树中，元素关键码与其存储位置之间没有对应的关系，因此在查找一个元素时，必须要经过关键码的多次比较。顺序查找时间复杂度为O(N)，平衡树中为树的高度，即O( $log_2N$ )，搜索的效率取决于搜索过程中元素的比较次数。
理想的搜索方法：可以不经过任何比较，一次直接从表中得到要搜索的元素。如果构造一种存储结构，通过某种函数(hashFunc)使元素的存储位置与它的关键码之间能够建立一一映射的关系，那么在查找时通过该函数可以很快找到该元素。
当向该结构中：

插入元素
根据待插入元素的关键码，以此函数计算出该元素的存储位置并按此位置进行存放。
搜索元素
对元素的关键码进行同样的计算，把求得的函数值当做元素的存储位置，在结构中按此位置取元素比
较，若关键码相等，则搜索成功。

该方式即为哈希(散列)方法，哈希方法中使用的转换函数称为哈希(散列)函数，构造出来的结构称为哈希表(Hash Table)(或者称散列表)。

在这里插入图片描述

2.2哈希冲突

上面介绍的方法有一个问题，若集合中有一个数据11，那应该存在哪里？下标为1的位置已经有数据1了。

不同关键字通过相同哈希哈数计算出相同的哈希地址，该种现象称为哈希冲突或哈希碰撞。 在接下来的方法中会解决这些冲突。

2.3哈希冲突解决

解决哈希冲突两种常见的方法是：闭散列和开散列。

2.3.1闭散列

闭散列：也叫开放定址法，当发生哈希冲突时，如果哈希表未被装满，说明在哈希表中必然还有空位置，那么可以把key存放到冲突位置中的“下一个” 空位置中去。

2.3.1.1线性探测

从发生冲突的位置开始，依次向后探测，直到寻找到下一个空位置为止。
在这里插入图片描述
接下来用代码来实现这一数据结构。
结构定义

enum Status
{
	EXIST,
	EMPTY,
	DELETE
};
template<class K, class V>
struct HashData
{
	pair<K, V> _kv;
	Status _status = EMPTY;
};
template<class K, class V, class HashFunc = Hash<K>>
class HashTable
{
private:
	vector<HashData<K, V>> _tables;
	size_t _n = 0;	// 有效数据个数
}

查找

HashData<K, V>* Find(const K& key)
{
	if (_tables.size() == 0)
	{
		return nullptr;
	}

	size_t start = key % _tables.size();//注意，这里不能模上capacity，capacity的空间不是都能访问的。
	size_t i = 0;
	size_t index = start;
	// 线性探测 or 二次探测
	while (_tables[index]._status != EMPTY)
	{
		if (_tables[index]._kv.first == key && _tables[index]._status == EXIST)
		{
			return &_tables[index];
		}

		++i;
		index = start + i;

		index %= _tables.size();
	}

	return nullptr;
}

插入

bool Insert(const pair<K, V>& kv)
{
	HashData<K, V>* ret = Find(kv.first);
	if (ret)
	{
		return false;
	}
	// 这里涉及负载因子的概念，负载因子 = 当前数据个数 / 总大小
	// 负载因子到0.7，就扩容
	// 负载因子越小，冲突概率越低，效率越高，空间浪费越多
	// 负载因子越大，冲突概率越高，效率越低，空间浪费越少
	if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
	{
		// 扩容
		size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		HashTable<K, V> newHT;
		newHT._tables.resize(newSize);
		for (size_t i = 0; i < _tables.size(); ++i)
		{
			if (_tables[i]._status == EXIST)
			{
				newHT.Insert(_tables[i]._kv);
			}
		}

		_tables.swap(newHT._tables);
	}
	size_t start = kv.first % _tables.size();
	size_t i = 0;
	size_t index = start;
	// 线性探测
	while (_tables[index]._status == EXIST)
	{
		++i;
		index = start + i;

		index %= _tables.size();
	}

	_tables[index]._kv = kv;
	_tables[index]._status = EXIST;
	++_n;

	return true;
}

删除

bool Erase(const K& key)
{
	HashData<K, V>* ret = Find(key);
	if (ret == nullptr)
	{
		return false;
	}
	else
	{
		--_n;
		ret->_status = DELETE;
		return true;
	}
}

写到这里有一个非常严重的问题没有解决，如果是两个string类型这段代码还能正常运行吗？显然不行，字符串不能进行取模操作，所以针对这种情况我们需要特殊处理。

template<class K>
struct Hash
{
	size_t operator()(const K& key)
	{
		return key;
	}
};
// 针对字符串的情况做特殊处理
template<>
struct Hash < string >
{
	size_t operator()(const string& s)
	{
		// BKDR
		size_t value = 0;
		for (auto ch : s)
		{
			value *= 31;
			value += ch;
		}
		return value;
	}
};

此时插入、查找也需要修改。

HashData<K, V>* Find(const K& key)
{
	if (_tables.size() == 0)
	{
		return nullptr;
	}

	HashFunc hf;
	size_t start = hf(key) % _tables.size();
	size_t i = 0;
	size_t index = start;
	// 线性探测
	while (_tables[index]._status != EMPTY)
	{
		if (_tables[index]._kv.first == key && _tables[index]._status == EXIST)
		{
			return &_tables[index];
		}

		++i;
		index = start + i;

		index %= _tables.size();
	}

	return nullptr;
}

bool Insert(const pair<K, V>& kv)
{
	HashData<K, V>* ret = Find(kv.first);
	if (ret)
	{
		return false;
	}
	// 这里涉及负载因子的概念，负载因子 = 当前数据个数 / 总大小
	// 负载因子到0.7，就扩容
	// 负载因子越小，冲突概率越低，效率越高，空间浪费越多
	// 负载因子越大，冲突概率越高，效率越低，空间浪费越少
	if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
	{
		// 扩容
		size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		HashTable<K, V, HashFunc> newHT;
		newHT._tables.resize(newSize);
		for (size_t i = 0; i < _tables.size(); ++i)
		{
			if (_tables[i]._status == EXIST)
			{
				newHT.Insert(_tables[i]._kv);
			}
		}

		_tables.swap(newHT._tables);
	}

	HashFunc hf;
	size_t start = hf(kv.first) % _tables.size();
	size_t i = 0;
	size_t index = start;
	// 线性探测
	while (_tables[index]._status == EXIST)
	{
		++i;
		index = start + i;

		index %= _tables.size();
	}

	_tables[index]._kv = kv;
	_tables[index]._status = EXIST;
	++_n;

	return true;
}

完整代码:

#include <vector>
#include <string>

template<class K>
struct Hash
{
	size_t operator()(const K& key)
	{
		return key;
	}
};

// 特化
template<>
struct Hash < string >
{
	size_t operator()(const string& s)
	{
		// BKDR
		size_t value = 0;
		for (auto ch : s)
		{
			value *= 31;
			value += ch;
		}
		return value;
	}
};


namespace CloseHash
{
	enum Status
	{
		EXIST,
		EMPTY,
		DELETE
	};

	template<class K, class V>
	struct HashData
	{
		pair<K, V> _kv;
		Status _status = EMPTY;
	};

	template<class K, class V, class HashFunc = Hash<K>>
	class HashTable
	{
	public:
		bool Erase(const K& key)
		{
			HashData<K, V>* ret = Find(key);
			if (ret == nullptr)
			{
				return false;
			}
			else
			{
				--_n;
				ret->_status = DELETE;
				return true;
			}
		}

		HashData<K, V>* Find(const K& key)
		{
			if (_tables.size() == 0)
			{
				return nullptr;
			}

			HashFunc hf;
			size_t start = hf(key) % _tables.size();//注意，这里不能模上capacity,capacity的空间不是都能访问的。
			size_t i = 0;
			size_t index = start;
			// 线性探测 or 二次探测
			while (_tables[index]._status != EMPTY)
			{
				if (_tables[index]._kv.first == key && _tables[index]._status == EXIST)
				{
					return &_tables[index];
				}

				++i;
				index = start + i;

				index %= _tables.size();
			}

			return nullptr;
		}

		bool Insert(const pair<K, V>& kv)
		{
			HashData<K, V>* ret = Find(kv.first);
			if (ret)
			{
				return false;
			}
			// 这里涉及负载因子的概念，负载因子 = 当前数据个数 / 总大小
			// 负载因子到0.7，就扩容
			// 负载因子越小，冲突概率越低，效率越高，空间浪费越多
			// 负载因子越大，冲突概率越高，效率越低，空间浪费越少
			if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
			{
				// 扩容
				size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
				HashTable<K, V, HashFunc> newHT;
				newHT._tables.resize(newSize);
				for (size_t i = 0; i < _tables.size(); ++i)
				{
					if (_tables[i]._status == EXIST)
					{
						newHT.Insert(_tables[i]._kv);
					}
				}

				_tables.swap(newHT._tables);
			}

			HashFunc hf;
			size_t start = hf(kv.first) % _tables.size();
			size_t i = 0;
			size_t index = start;
			// 线性探测 or 二次探测
			while (_tables[index]._status == EXIST)
			{
				++i;
				//index = start + i*i;
				index = start + i;

				index %= _tables.size();
			}

			_tables[index]._kv = kv;
			_tables[index]._status = EXIST;
			++_n;
			return true;
		}

	private:
		vector<HashData<K, V>> _tables;
		size_t _n = 0;	// 有效数据个数
	};
}

2.3.2开散列

开散列法又叫链地址法(开链法)，首先对关键码集合用散列函数计算散列地址，具有相同地址的关键码归于同一子集合，每一个子集合称为一个桶，各个桶中的元素通过一个单链表链接起来，各链表的头结点存储在哈希表中。

结构定义

template<class T>
struct HashNode
{
	T _data;
	HashNode<T>* _next;

	HashNode(const T& data)
		:_data(data)
		, _next(nullptr)
	{}
};

查找

Node* Find(const K& key)
{
	if (_tables.empty())
	{
		return nullptr;
	}

	HashFunc hf;
	size_t index = hf(key) % _tables.size();
	Node* cur = _tables[index];
	KeyOfT kot;
	while (cur)
	{
		if (kot(cur->_data) == key)
		{
			return cur;
		}
		else
		{
			cur = cur->_next;
		}
	}

	return nullptr;
}

插入

bool Insert(const T& data)
{
	Node* ret = Find(kv.first);
	if (ret)
		return false;

	HashFunc hf;
	// 负载因子 == 1时扩容
	if (_n == _tables.size())
	{
		size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		vector<Node*> newTables;
		newTables.resize(newSize);
		for (size_t i = 0; i < _tables.size(); ++i)
		{
			Node* cur = _tables[i];
			while (cur)
			{
				Node* next = cur->_next;

				size_t index = hf(cur->_data) % newTables.size();
				// 头插
				cur->_next = newTables[index];
				newTables[index] = cur;

				cur = next;
			}

			_tables[i] = nullptr;
		}

		_tables.swap(newTables);
	}


	size_t index = hf(kv.first) % _tables.size();
	Node* newnode = new Node(data);
	// 头插
	newnode->_next = _tables[index];
	_tables[index] = newnode;

	++_n;
	return true;
}

删除

bool Erase(const K& key)
{
	if (_tables.empty())
	{
		return false;
	}
	HashFunc hf;
	size_t index = hf(key) % _tables.size();
	Node* prev = nullptr;
	Node* cur = _tables[index];
	KeyOfT kot;
	while (cur)
	{
		if (kot(cur->_data) == key)
		{
			if (prev == nullptr) // 头删
			{
				_tables[index] = cur->_next;
			}
			else // 中间删除
			{
				prev->_next = cur->_next;
			}

			--_n;

			delete cur;

			return true;
		}
		else
		{
			prev = cur;
			cur = cur->_next;
		}
	}

	return false;
}

到这里开散列的基本框架也就完成了。
但还没完，我们还需要实现它的迭代器，以及用模板封装。
完整代码:

namespace LinkHash
{
	template<class T>
	struct HashNode
	{
		T _data;
		HashNode<T>* _next;

		HashNode(const T& data)
			:_data(data)
			, _next(nullptr)
		{}
	};

	template<class K, class T, class KeyOfT, class HashFunc>
	class HashTable;

	template<class K, class T, class Ref, class Ptr, class KeyOfT, class HashFunc>
	struct __HTIterator
	{
		typedef HashNode<T> Node;
		typedef __HTIterator<K, T, Ref, Ptr, KeyOfT, HashFunc> Self;

		Node* _node;
		HashTable<K, T, KeyOfT, HashFunc>* _pht;

		__HTIterator(Node* node, HashTable<K, T, KeyOfT, HashFunc>* pht)
			:_node(node)
			, _pht(pht)
		{}

		Ref operator*()
		{
			return _node->_data;
		}

		Ptr operator->()
		{
			return &_node->_data;
		}

		Self& operator++()
		{
			if (_node->_next)
			{
				_node = _node->_next;
			}
			else
			{
				KeyOfT kot;
				HashFunc hf;
				size_t index = hf(kot(_node->_data)) % _pht->_tables.size();
				++index;
				// 找下一个不为空的桶
				while (index < _pht->_tables.size())
				{
					if (_pht->_tables[index])
					{
						break;
					}
					else
					{
						++index;
					}
				}

				// 表走完了，都没有找到下一个桶
				if (index == _pht->_tables.size())
				{
					_node = nullptr;
				}
				else
				{
					_node = _pht->_tables[index];
				}
			}

			return *this;
		}

		bool operator!=(const Self& s) const
		{
			return _node != s._node;
		}

		bool operator==(const Self& s) const
		{
			return _node == s._node;
		}
	};

	template<class K, class T, class KeyOfT, class HashFunc>
	class HashTable
	{
		typedef HashNode<T> Node;

		template<class K, class T, class Ref, class Ptr, class KeyOfT, class HashFunc>
		friend struct __HTIterator;
		typedef HashTable<K, T, KeyOfT, HashFunc> Self;
	public:
		typedef __HTIterator<K, T, T&, T*, KeyOfT, HashFunc> iterator;
		// 显示指定编译器去生成默认构造函数
		HashTable() = default;

		HashTable(const Self& ht)
		{
			_tables.resize(ht._tables.size());
			for (size_t i = 0; i < ht._tables.size(); ++i)
			{
				Node* cur = ht._tables[i];
				while (cur)
				{
					Node* copy = new Node(cur->_data);
					copy->_next = _tables[i];
					_tables[i] = copy;

					cur = cur->_next;
				}
			}
		}

		// ht1 = ht2
		Self& operator=(Self copy)
		{
			swap(_n, copy._n);
			_tables.swap(copy._tables);
			return *this;
		}

		~HashTable()
		{
			for (size_t i = 0; i < _tables.size(); ++i)
			{
				Node* cur = _tables[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}
				_tables[i] = nullptr;
			}
		}

		iterator begin()
		{
			for (size_t i = 0; i < _tables.size(); ++i)
			{
				if (_tables[i])
				{
					return iterator(_tables[i], this);
				}
			}

			return end();
		}

		iterator end()
		{
			return iterator(nullptr, this);
		}

		bool Erase(const K& key)
		{
			if (_tables.empty())
			{
				return false;
			}

			HashFunc hf;
			size_t index = hf(key) % _tables.size();
			Node* prev = nullptr;
			Node* cur = _tables[index];
			KeyOfT kot;
			while (cur)
			{
				if (kot(cur->_data) == key)
				{
					if (prev == nullptr) // 头删
					{
						_tables[index] = cur->_next;
					}
					else // 中间删除
					{
						prev->_next = cur->_next;
					}

					--_n;

					delete cur;

					return true;
				}
				else
				{
					prev = cur;
					cur = cur->_next;
				}
			}

			return false;
		}

		iterator Find(const K& key)
		{
			if (_tables.empty())
			{
				return end();
			}

			HashFunc hf;
			size_t index = hf(key) % _tables.size();
			Node* cur = _tables[index];
			KeyOfT kot;
			while (cur)
			{
				if (kot(cur->_data) == key)
				{
					return iterator(cur, this);
				}
				else
				{
					cur = cur->_next;
				}
			}

			return end();
		}


		pair<iterator, bool> Insert(const T& data)
		{
			KeyOfT kot;
			iterator ret = Find(kot(data));
			if (ret != end())
				return make_pair(ret, false);

			HashFunc hf;
			// 负载因子 == 1时扩容
			if (_n == _tables.size())
			{
				size_t newSize = _tables.size() == 0 ? 10 :_tables.size()* 2;
				if(newSize > _tables.size())
				{
					vector<Node*> newTables;
					newTables.resize(newSize);
					for (size_t i = 0; i < _tables.size(); ++i)
					{
						Node* cur = _tables[i];
						while (cur)
						{
							Node* next = cur->_next;
	
							size_t index = hf(kot(cur->_data)) % newTables.size();
							// 头插
							cur->_next = newTables[index];
							newTables[index] = cur;
	
							cur = next;
						}
	
						_tables[i] = nullptr;
					}
	
					_tables.swap(newTables);
				}
				
			}


			size_t index = hf(kot(data)) % _tables.size();
			Node* newnode = new Node(data);
			// 头插
			newnode->_next = _tables[index];
			_tables[index] = newnode;

			++_n;
			return make_pair(iterator(newnode, this), false);;
		}

	private:
		vector<Node*> _tables;

		size_t _n = 0;  // 有效数据的个数
	};
}

3.unordered_set和unordered_map模拟实现

有了前面哈希表的铺垫，接下来的模拟实现就非常容易了。

3.1unordered_map模拟实现

#include "HashTable.h"
namespace bit
{
	template<class K, class V, class hash = Hash<K>>
	class unordered_map
	{
		struct MapKeyOfT
		{
			const K& operator()(const pair<K, V>& kv)
			{
				return kv.first;
			}
		};
	public:
		typedef typename LinkHash::HashTable<K, pair<K, V>, MapKeyOfT, hash>::iterator iterator;
		iterator begin()
		{
			return _ht.begin();
		}

		iterator end()
		{
			return _ht.end();
		}

		V& operator[](const K& key)
		{
			auto ret = _ht.Insert(make_pair(key, V()));
			return ret.first->second;
		}

		pair<iterator, bool> insert(const pair<K, V>& kv)
		{
			return _ht.Insert(kv);
		}

	private:
		LinkHash::HashTable<K, pair<K, V>, MapKeyOfT, hash> _ht;
	};
}

3.2unordered_set模拟实现

#pragma once
#include "HashTable.h"
namespace bit
{
	template<class K, class hash = Hash<K>>
	class unordered_set
	{
		struct SetKeyOfT
		{
			const K& operator()(const K& key)
			{
				return key;
			}
		};
	public:
		typedef typename LinkHash::HashTable<K, K, SetKeyOfT, hash>::iterator iterator;
		iterator begin()
		{
			return _ht.begin();
		}

		iterator end()
		{
			return _ht.end();
		}

		pair<iterator, bool> insert(const K& key)
		{
			return _ht.Insert(key);
		}
	private:
		LinkHash::HashTable<K, K, SetKeyOfT, hash> _ht;
	};
}

4.位图

4.1经典面试题

我们来看一道腾讯的面试题
给40亿个不重复的无符号整数，没排过序。给一个无符号整数，如何快速判断一个数是否在这40亿个数中。
思路一:
将数据存储起来，然后遍历，时间复杂度为O(N)。这个方法有一个弊端，非常耗费空间，我们可以来算算，一个整型4字节，40亿个整型就是160亿字节，换算成GB，大概需要15G的内存，这显然不可行。
思路二:
先排序，再用二分查找，时间复杂度(O( $Nlog_2N$ ))，再利用二分查找: O( $log_2N$ )，这个方法比刚刚稍微优化一点，但还不是最佳。
思路三:
利用位图去存储，根据数据是否在给定的整形数据中，结果是在或者不在，刚好是两种状态，那么可以使用一个二进制比特位来代表数据是否存在的信息，如果二进制比特位为1，代表存在，为0代表不存在。
注意，我们在分配空间时，不是分配40亿个整型的空间，是按整型的范围来分配的，即最大为 $2^{32}$ ，1个字节有8个bit，所以 $2^{32}$ 还要除8，当然可能会出现这种情况，15除8等于1，所以最终还需要再加上1。
接下来我们来模拟实现一下。
结构定义

template<size_t N>
class bitset
{
public:
	bitset()
	{
		_bits.resize(N / 8 + 1);
	}
private:
	std::vector<char> _bits;
};

接下来就是把数据存入，即设置对应的位置为1。

void set(size_t x)
{
	size_t i = x / 8;
	size_t j = x % 8;

	_bits[i] |= (1 << j);
}

举个例子来解释一下这个位操作。假设存入15，
00000000 刚开始还没存入数据，所以是0
10000000 00000001向左移7位
10000000 按位或的结果
除了存入数据，还需要删除数据。

void reset(size_t x)
{
	size_t i = x / 8;
	size_t j = x % 8;

	_bits[i] &= (~(1 << j));
}

还是以刚刚15位例，
10000000 15的位图
011111111 1向左移7位按位取反
00000000 按位与的结果
最后还需要一个检验某个数在不在这个位图里的函数

bool test(size_t x)
{
	size_t i = x / 8;
	size_t j = x % 8;

	return _bits[i] & (1 << j);
}

这个相信大家很容易看懂，相当于在对应位置用1按位与，若为1则表示存在，反之则不存在。
完整代码:

template<size_t N>
class bitset
{
public:
	bitset()
	{
		_bits.resize(N / 8 + 1, 0);
	}

	void set(size_t x)
	{
		size_t i = x / 8;
		size_t j = x % 8;

		_bits[i] |= (1 << j);
	}

	void reset(size_t x)
	{
		size_t i = x / 8;
		size_t j = x % 8;

		_bits[i] &= (~(1 << j));
	}

	bool test(size_t x)
	{
		size_t i = x / 8;
		size_t j = x % 8;

		return _bits[i] & (1 << j);
	}

private:
	std::vector<char> _bits;
};

4.2位图应用

我们再来看看与位图相关的三道题

1. 给定100亿个整数，设计算法找到只出现一次的整数？
2. 给两个文件，分别有100亿个整数，我们只有1G内存，如何找到两个文件交集？
3. 位图应用变形：1个文件有100亿个int，1G内存，设计算法找到出现次数不超过2次的所有整数

题目一:
1.给定100亿个整数，设计算法找到只出现一次的整数？
这道题实际上就是要想办法区分这三种状态
1.出现0次
2.出现1次
3.出现2次及以上
1个bit可以标识两种状态，2个bit可以标识四种状态，所以我们只需要用两个容器来存储即可。

template<size_t N>
class TwoBitSet
{
public:
	void Set(size_t x)
	{
		if (!_bs1.test(x) && !_bs2.test(x)) // 00 -> 01
		{
			_bs2.set(x);
		}
		else if (!_bs1.test(x) && _bs2.test(x)) // 01 -> 10
		{
			_bs1.set(x);
			_bs2.reset(x);
		}
	}
	void PrintOnceNum()
	{
		for (size_t i = 0; i < N; ++i)
		{
			if (!_bs1.test(i) && _bs2.test(i)) // 01
			{
				cout << i << endl;
			}
		}
	}
private:
	bit::bitset<N> _bs1;
	bit::bitset<N> _bs2;
};

题目二:
2.给两个文件，分别有100亿个整数，我们只有1G内存，如何找到两个文件交集？
思路一: 先把第一个文件中的整数set到一个位图，读取第二个文件中的整数判断在不在位图。
这种思路有一个缺陷，就是会把重复值给找出来，最后还得去重，所以不可取。
思路二:
分别把两个文件中的数据set到bs1、bs2中，接下来有两种处理方法:
1.遍历bs2中的值，看在不在bs1中
2.bs1中的值依次跟bs2中的值按位与。

题目三:
3.1个文件有100亿个int，1G内存，设计算法找到出现次数不超过2次的所有整数
搞懂了第一题，这题实际上就很容易了，就是第一题的变形。不超过两次即要标识4种状态。
1.出现0次 00
2.出现1次 01
3.出现2次 10
4.出现3次即以上 11
如果这题题目改成不超过6次，你应该就能明白用三个位图就可以解决了。

喜欢这篇文章的可以给个一键三连点赞👍关注💡收藏💖