Hnswlib
是一个强大的近邻搜索(ANN)库, 官方介绍 Header-only C++ HNSW implementation with python bindings, insertions and updates
. 热门的向量数据库Milvus底层的ANN库之一就是Hnswlib
, 为milvus提供HNSW检索。
HNSW 原理
HNSW 原理
将节点划分成不同层级,贪婪地遍历来自上层的元素,直到达到局部最小值,然后切换到下一层,以上一层中的局部最小值作为新元素重新开始遍历,直到遍历完最低一层。
安装使用
从源码安装:
apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .
或者直接pip安装 pip install hnswlib
python 使用
import hnswlib
import numpy as np
dim = 16
num_elements = 10000
# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)
# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)
print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "n")
# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p
# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.
print("nLoading index from 'first_half.bin'n")
# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)
print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "n")
依次介绍:
distances
支持三种距离算法, l2, ip内积,以及cos。
Distance | parameter | Equation |
---|---|---|
Squared L2 | ‘l2’ | d = sum((Ai-Bi)^2) |
Inner product | ‘ip’ | d = 1.0 – sum(Ai*Bi) |
Cosine similarity | ‘cosine’ | d = 1.0 – sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi)) |
API
定义 index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
space 指定Distance算法,dim是向量的维度。
初始化索引
p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
-
max_elements
– 最大容量 (capacity),如果插入数据超过容量会报异常,可以动态扩容 - ef_construction – 平衡索引构建速度和搜索准确率,ef_construction越大,准确率越高但是构建速度越慢。 ef_construction 提高并不能无限增加索引的质量,常见的 ef_constructio n 参数为 128。
- M – 表示在建表期间每个向量的边数目量,M会影响内存消耗,M越高,内存占用越大,准确率越高,同时构建速度越慢。通常建议设置在 8-32 之间。
添加数据与查询数据
# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)
# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)
print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "n")
-
p.set_ef(10)
:设置搜索时的最大近邻数量(ef),即在构建索引时最多保留多少个近邻。较高的ef值会导致更好的准确率,但搜索速度会变慢。 -
p.set_num_threads(4)
:设置在批量搜索和构建索引过程中使用的线程数。默认情况下,使用所有可用的核心。 -
p.add_items(data1)
:将数据添加到索引中。 -
labels, distances = p.knn_query(data1, k=1)
:对数据中的每个元素进行查询,找到与其最近的邻居,返回邻居的标签和距离。
保持与加载索引
# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p
# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.
print("nLoading index from 'first_half.bin'n")
# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)
print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "n")
- 通过
save_index
保存索引 - 然后
load_index
重新加载索引,只要未超过max_elements
,可以再次add_items
C++使用
官方提供了C++ 例子,创建索引、插入元素、搜索和序列化
#include "../../hnswlib/hnswlib.h"
int main() {
int dim = 16; // Dimension of the elements
int max_elements = 10000; // Maximum number of elements, should be known beforehand
int M = 16; // Tightly connected with internal dimensionality of the data
// strongly affects the memory consumption
int ef_construction = 200; // Controls index search speed/build speed tradeoff
// Initing index
hnswlib::L2Space space(dim);
hnswlib::HierarchicalNSW* alg_hnsw = new hnswlib::HierarchicalNSW(&space, max_elements, M, ef_construction);
// Generate random data
std::mt19937 rng;
rng.seed(47);
std::uniform_real_distribution distrib_real;
float* data = new float[dim * max_elements];
for (int i = 0; i addPoint(data + i * dim, i);
}
// Query the elements for themselves and measure recall
float correct = 0;
for (int i = 0; i > result = alg_hnsw->searchKnn(data + i * dim, 1);
hnswlib::labeltype label = result.top().second;
if (label == i) correct++;
}
float recall = correct / max_elements;
std::cout saveIndex(hnsw_path);
delete alg_hnsw;
// Deserialize index and check recall
alg_hnsw = new hnswli服务器托管网b::HierarchicalNSW(&space, hnsw_path);
correct = 0;
for (int i = 0; i > result = alg_hnsw->searchKnn(data + i * dim, 1);
hnswlib::labeltype label = result.top().second;
if (label == i) correct++;
}
recall = (float)correct / max_elements;
std::cout
Milvus 使用
milvus 通过cgo调用knowhere,knowhere是一个向量检索的抽象封装,集成了FAISS, HNSW等开源ANN库。
knowhere 是直接将hnswlib代码引入,使用hnswlib的代码在
https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/hnsw.cc
主要是基于hnswlib的C接口,实现HnswIndexNode
namespace knowhere {
class HnswIndexNode : public IndexNode {
public:
HnswIndexNode(const int32_t& /*version*/, const Object& object) : index_(nullptr) {
search_pool_ = ThreadPool::GetGlobalSearchThreadPool();
}
Status
Train(const DataSet& dataset, const Config& cfg) override {
auto rows = dataset.GetRows();
auto dim = dataset.GetDim();
auto hnsw_cfg = static_cast(cfg);
hnswlib::SpaceInterface* space = nullptr;
if (IsMetricType(hnsw_cfg.metric_type.value(), metric::L2)) {
space = new (std::nothrow) hnswlib::L2Space(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::IP)) {
space = new (std::nothrow) hnswlib::InnerProductSpace(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::COSINE)) {
space = new (std::nothrow) hnswlib::CosineSpace(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metr服务器托管网ic::HAMMING)) {
space = new (std::nothrow) hnswlib::HammingSpace(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::JACCARD)) {
space = new (std::nothrow) hnswlib::JaccardSpace(dim);
} else {
LOG_KNOWHERE_WARNING_ (space, rows, hnsw_cfg.M.value(), hnsw_cfg.efConstruction.value());
if (index == nullptr) {
LOG_KNOWHERE_WARNING_ index_) {
delete this->index_;
LOG_KNOWHERE_WARNING_ index_ = index;
return Status::success;
}
Status
Add(const DataSet& dataset, const Config& cfg) override {
// ...
std::atomic counter{0};
uint64_t one_tenth_row = rows / 10;
for (int i = 1; i push([&, idx = i]() {
index_->addPoint(((const char*)tensor + index_->data_size_ * idx), idx);
uint64_t added = counter.fetch_add(1);
if (added % one_tenth_row == 0) {
LOG_KNOWHERE_INFO_
其他实现
- Go实现:https://github.com/Bithack/go-hnsw
- Java实现:https://github.com/jelmerk/hnswlib
- 使用Java Native Access的Java绑定:https://github.com/stepstone-tech/hnswlib-jna
服务器托管,北京服务器托管,服务器租用 http://www.fwqtg.net
机房租用,北京机房租用,IDC机房托管, http://www.fwqtg.net
目录 1、前言 免责声明 2、我这里已有的 GT 高速接口解决方案 3、GTH 全网最细解读 GTH 基本结构 GTH 发送和接收处理流程 GTH 的参考时钟 GTH 发送接口 GTH 接收接口 GTH IP核调用和使用 4、设计思路框架 视频源选择 sili…