Hnswlib 介绍与入门使用 - 服务器托管|北京服务器租用|机房托管租用|IDC托管租用|机房机柜带宽租用-价格及费用咨询

Hnswlib是一个强大的近邻搜索（ANN)库，官方介绍 Header-only C++ HNSW implementation with python bindings, insertions and updates. 热门的向量数据库Milvus底层的ANN库之一就是Hnswlib, 为milvus提供HNSW检索。

HNSW 原理

将节点划分成不同层级，贪婪地遍历来自上层的元素，直到达到局部最小值，然后切换到下一层，以上一层中的局部最小值作为新元素重新开始遍历，直到遍历完最低一层。

安装使用

从源码安装：

apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .

或者直接pip安装 pip install hnswlib

python 使用

import hnswlib
import numpy as np

dim = 16
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]

# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "n")

# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("nLoading index from 'first_half.bin'n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "n")

依次介绍：

distances

支持三种距离算法， l2, ip内积，以及cos。

Distance	parameter	Equation
Squared L2	‘l2’	d = sum((Ai-Bi)^2)
Inner product	‘ip’	d = 1.0 – sum(Ai*Bi)
Cosine similarity	‘cosine’	d = 1.0 – sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

API

定义 index

p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

space 指定Distance算法，dim是向量的维度。

初始化索引

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

max_elements – 最大容量 (capacity)，如果插入数据超过容量会报异常，可以动态扩容
ef_construction – 平衡索引构建速度和搜索准确率，ef_construction越大，准确率越高但是构建速度越慢。 ef_construction 提高并不能无限增加索引的质量，常见的 ef_constructio n 参数为 128。
M – 表示在建表期间每个向量的边数目量，M会影响内存消耗，M越高，内存占用越大，准确率越高，同时构建速度越慢。通常建议设置在 8-32 之间。

添加数据与查询数据

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "n")

p.set_ef(10)：设置搜索时的最大近邻数量（ef），即在构建索引时最多保留多少个近邻。较高的ef值会导致更好的准确率，但搜索速度会变慢。
p.set_num_threads(4)：设置在批量搜索和构建索引过程中使用的线程数。默认情况下，使用所有可用的核心。
p.add_items(data1)：将数据添加到索引中。
labels, distances = p.knn_query(data1, k=1)：对数据中的每个元素进行查询，找到与其最近的邻居，返回邻居的标签和距离。

保持与加载索引


# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("nLoading index from 'first_half.bin'n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "n")

通过save_index保存索引
然后load_index重新加载索引，只要未超过max_elements，可以再次add_items

C++使用

官方提供了C++ 例子，创建索引、插入元素、搜索和序列化

#include "../../hnswlib/hnswlib.h"


int main() {
    int dim = 16;               // Dimension of the elements
    int max_elements = 10000;   // Maximum number of elements, should be known beforehand
    int M = 16;                 // Tightly connected with internal dimensionality of the data
                                // strongly affects the memory consumption
    int ef_construction = 200;  // Controls index search speed/build speed tradeoff

    // Initing index
    hnswlib::L2Space space(dim);
    hnswlib::HierarchicalNSW* alg_hnsw = new hnswlib::HierarchicalNSW(&space, max_elements, M, ef_construction);

    // Generate random data
    std::mt19937 rng;
    rng.seed(47);
    std::uniform_real_distribution distrib_real;
    float* data = new float[dim * max_elements];
    for (int i = 0; i addPoint(data + i * dim, i);
    }

    // Query the elements for themselves and measure recall
    float correct = 0;
    for (int i = 0; i > result = alg_hnsw->searchKnn(data + i * dim, 1);
        hnswlib::labeltype label = result.top().second;
        if (label == i) correct++;
    }
    float recall = correct / max_elements;
    std::cout saveIndex(hnsw_path);
    delete alg_hnsw;

    // Deserialize index and check recall
    alg_hnsw = new hnswli服务器托管网b::HierarchicalNSW(&space, hnsw_path);
    correct = 0;
    for (int i = 0; i > result = alg_hnsw->searchKnn(data + i * dim, 1);
        hnswlib::labeltype label = result.top().second;
        if (label == i) correct++;
    }
    recall = (float)correct / max_elements;
    std::cout

Milvus 使用

milvus 通过cgo调用knowhere，knowhere是一个向量检索的抽象封装，集成了FAISS, HNSW等开源ANN库。

knowhere 是直接将hnswlib代码引入，使用hnswlib的代码在
https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/hnsw.cc

主要是基于hnswlib的C接口，实现HnswIndexNode

namespace knowhere {
class HnswIndexNode : public IndexNode {
 public:
    HnswIndexNode(const int32_t& /*version*/, const Object& object) : index_(nullptr) {
        search_pool_ = ThreadPool::GetGlobalSearchThreadPool();
    }

    Status
    Train(const DataSet& dataset, const Config& cfg) override {
        auto rows = dataset.GetRows();
        auto dim = dataset.GetDim();
        auto hnsw_cfg = static_cast(cfg);
        hnswlib::SpaceInterface* space = nullptr;
        if (IsMetricType(hnsw_cfg.metric_type.value(), metric::L2)) {
            space = new (std::nothrow) hnswlib::L2Space(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::IP)) {
            space = new (std::nothrow) hnswlib::InnerProductSpace(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::COSINE)) {
            space = new (std::nothrow) hnswlib::CosineSpace(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metr服务器托管网ic::HAMMING)) {
            space = new (std::nothrow) hnswlib::HammingSpace(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::JACCARD)) {
            space = new (std::nothrow) hnswlib::JaccardSpace(dim);
        } else {
            LOG_KNOWHERE_WARNING_ (space, rows, hnsw_cfg.M.value(), hnsw_cfg.efConstruction.value());
        if (index == nullptr) {
            LOG_KNOWHERE_WARNING_ index_) {
            delete this->index_;
            LOG_KNOWHERE_WARNING_ index_ = index;
        return Status::success;
    }

    Status
    Add(const DataSet& dataset, const Config& cfg) override {
        
		// ...
     
        std::atomic counter{0};
        uint64_t one_tenth_row = rows / 10;
        for (int i = 1; i push([&, idx = i]() {
                index_->addPoint(((const char*)tensor + index_->data_size_ * idx), idx);
                uint64_t added = counter.fetch_add(1);
                if (added % one_tenth_row == 0) {
                    LOG_KNOWHERE_INFO_

其他实现

Go实现：https://github.com/Bithack/go-hnsw
Java实现：https://github.com/jelmerk/hnswlib
使用Java Native Access的Java绑定：https://github.com/stepstone-tech/hnswlib-jna

服务器托管，北京服务器托管，服务器租用 http://www.fwqtg.net
机房租用，北京机房租用，IDC机房托管， http://www.fwqtg.net

相关推荐: FPGA GTH aurora 8b/10b编解码 PCIE 视频传输，提供2套工程源码加QT上位机源码和技术支持
1、前言
2、我这里已有的 GT 高速接口解决方案
3、GTH 全网最细解读
4、设计思路框架
5、vivado工程1–&g

目录 1、前言免责声明 2、我这里已有的 GT 高速接口解决方案 3、GTH 全网最细解读 GTH 基本结构 GTH 发送和接收处理流程 GTH 的参考时钟 GTH 发送接口 GTH 接收接口 GTH IP核调用和使用 4、设计思路框架视频源选择 sili…