【NLP笔记】文本向量化 - 服务器托管|北京服务器租用|机房托管租用|IDC托管租用|机房机柜带宽租用-价格及费用咨询

文章目录

概念
代码实战
- 经典向量化模型
- - One-Hot编码
  - 词袋模型（Bag Of Words，BOW）
  - TF-IDF
  - N-元模型（n-gram）
  - word2vec
  - Doc2Vec
  - Glove
- 基于神经网络的Tokenizer
- - Bert-Tokenizer
  - CodeBert-Tokenizer
  - Claude-Tokenizer

在自然语言处理中，文本向量化（Text Embedding）是很重要的一环，是将文本数据转换成向量表示，包括词、句子、文档级别的文本，深度学习向量表征就是通过算法将数据转换成计算机可处理的数字化形式。

概念

参考文章（转载）：大模型开发 – 一文搞懂Embedding工作原理;

从不同文本级别出发，文本向量化包含以下方法：

词级别向量化：将单个词汇转换成数值向量
- 独热编码（One-Hot Encoding）：为每个词分配一个唯一的二进制向量，其中只有一个位置是1，其余位置是0。
- TF-IDF：通过统计词频和逆文档频率来生成词向量或文档向量。
- N-gram：基于统计的n个连续词的频率来生成向量。
- 词嵌入（Word Embeddings）：如Word2Vec, GloVe, FastText等，将每个词映射到一个高维实数向量，这些向量在语义上是相关的。
句子向量化：将整个句子转换为一个数值向量。
- 简单平均/加权平均：对句子中的词向量进行平均或根据词频进行加权平均。
- 递归神经网络（RNN）：通过递归地处理句子中的每个词来生成句子表示。
- 卷积神经网络（CNN）：使用卷积层来捕捉句子中的局部特征，然后生成句子表示。
- 自注意力机制（如Transformer）：如BERT模型，通过对句子中的每个词进行自注意力计算来生成句子表示；再比如现在的大模型，很多都会有对应的训练好的tokenizer，直接采用对应的tokenizer进行文本向量化。
文档向量化：将整个文档（如一篇文章或一组句子）转换为一个数值向量。
- 简单平均/加权平均：对文档中的句子向量进行平均或加权平均。
- 文档主题模型（如LDA）：通过捕捉文档中的主题分布来生成文档表示。
- 层次化模型：如Doc2Vec，它扩展了Word2Vec，可以生成整个文档的向量表示。

代码实战

经典向量化模型

参考链接（转载）：NLP-(1)-文本向量化;

One-Hot编码

又称独热编码，将每个词表示成具有n个元素的向量，这个词向量中只有一个元素是1，其他元素都是0，不同词汇元素为0的位置不同，其中n的大小是整个语料中不同词汇的总数。One-Hot编码的缺点是完全割裂了词与词之间的联系，而且在大语料集下，每个向量的长度过大，且较为稀疏，占据大量内存。

# 导入keras中的词汇映射器Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
# 假定vocab为语料集所有不同词汇集合
vocab = {"我", "爱", "北京", "天安门", "升国旗"}
# 实例化一个词汇映射器对象
t = Tokenizer(num_words=None, char_level=False)
# 使用映射器拟合现有文本数据
t.fit_on_texts(vocab)

for token in vocab:
    zero_list = [0]*len(vocab)
    # 使用映射器转化现有文本数据, 每个词汇对应从1开始的自然数
    # 返回样式如: [[2]], 取出其中的数字需要使用[0][0]
    token_index = t.texts_to_sequences([token])[0][0] - 1
    zero_list[token_index] = 1
    print(token, "的one-hot编码为:", zero_list)

词袋模型（Bag Of Words，BOW）

词袋是指把一篇文章进行词汇的整理，然后统计每个词汇出现的次数，由前几名的词汇猜测全文大意。具体做法包括：

分词：将整篇文章中的每个词汇切开，整理成生字表或字典。英文一般以空白或者句点隔开，中文需要通过特殊的方法进行处理如jieba等。

前置处理：先将词汇做词性还原，转换成小写。词性还原和转换小写都是为了避免，词汇统计出现分歧。

去除停用词：be动词、助动词、介词、冠词等不具有特殊意义的词汇称为停用词在文章中是大量存在的，需要将它们剔除，否则统计结果都是这些词汇。

词频统计：计算每个词汇在文章中出现的次数，由高到低进行排序。

# coding=utf-8
import collections

stop_words = ['n', 'or', 'are', 'they', 'i', 'some', 'by', '—',
              'even', 'the', 'to', 'a', 'and', 'of', 'in', 'on', 'for',
              'that', 'with', 'is', 'as', 'could', 'its', 'this', 'other',
              'an', 'have', 'more', 'at', "don’t", 'can', 'only', 'most']

maxlen = 1000
word_freqs = collections.Counter()
# word_freqs = {}
# print(word_freqs)
with open('../data/NLP_data/news.txt', 'r+', encoding='utf8') as f:
    for line in f:
        words = line.lower().split(' ')
        if len(words) > maxlen:
            maxlen = len(words)

        for word in words:
            if not (word in stop_words):
                word_freqs[word] += 1
                # 词频统计
                # count = word_freqs.get(word, 0)
                # print(count)
                # word_freqs[word] = count + 1

# print(word_freqs)
print(word_freqs.most_common(20))

# 按照字典的value进行排序
# a1 = sorted(word_freqs.items(), key=lambda x: x[1], reverse=True)
# print(a1[:20])
"""
[('stores', 15), ('convenience', 14), ('korean', 6), ('these', 6), ('one', 6), ('it’s', 6), ('from', 5), ('my', 5), ('you', 5), ('their', 5), ('just', 5), ('has', 5), ('new', 4), ('do', 4), ('also', 4), ('which', 4), ('find', 4), ('would', 4), ('like', 4), ('up', 4)]
"""

TF-IDF

BOW方法十分简单，效果也不错，不过他有个缺点，有些词汇不是停用词，但是在文章中经常出现，但对全文并不重要，比如only、most等，对猜测全文大意没有太多的帮助，所以提出了改良算法tf-idf,他会针对跨文件常出现的词汇给与较低的分数，如only在每一个文件中都出现过，那么tf-idf对他的评分就会很低。

# TF-IDF匹配问答对
# coding=utf-8
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third document.',
    'Is this the first document?'
]

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)

word = vectorizer.get_feature_names()
print('Vocabulary:', word)

print(x.toarray())

# TF-IDF转换
transfomers = TfidfTransformer()
tfidf = transfomers.fit_transform(x)
print(np.around(tfidf.toarray(), 4))

from sklearn.metrics.pairwise import cosine_similarity
# 比较最后一句与其他句子的相似度
print(cosine_similarity(tfidf[-1], tfidf[:-1], dense_output=False))

这里需要注意的是sklearn计算tf-idf公式有些许区别：

手动实现tf-idf完整代码：

注意：分子分母同时增加1 为了平滑处理、增加了归一化处理计算平方根。

# coding=utf-8
import math
import numpy

corpus = [
    "what is the weather like today",
    "what is for dinner tonight",
    "this is a question worth pondering",
    "it is a beautiful day today"
]
words = []
# 对corpus分词
for i in corpus:
    words.append(i.split())


# 进行词频统计
def Counter(word_list):
    wordcount = []
    for i in word_list:
        count = {}
        for j in i:
            if not count.get(j):
                count.update({j: 1})
            elif count.get(j):
                count[j] += 1
        wordcount.append(count)
    return wordcount


wordcount = Counter(words)

print(wordcount)


# 计算TF(word代表被计算的单词，word_list是被计算单词所在文档分词后的字典)
def tf(word, word_list):
    return word_list.get(word) / sum(word_list.values())


# 统计含有该单词的句子数
def count_sentence(word, wordcount):
    return sum(1 for i in wordcount if i.get(word))


# 计算IDF
def idf(word, wordcount):
    # return math.log(len(wordcount) / (count_sentence(word, wordcount) + 1))  # 10
    # return numpy.log(len(wordcount) / (count_sentence(word, wordcount) + 1))   # e
    return math.log((1 + len(wordcount)) / (count_sentence(word, wordcount) + 1)) + 1  # e


# 计算TF-IDF
def tfidf(word, word_list, wordcount):
    # print(word, idf(word, wordcount))
    return tf(word, word_list) * idf(word, wordcount)


p = 1

for i in wordcount:
    tf_idfs = 0
    print("part:{}".format(p))
    p = p + 1
    for j, k in i.items():
        print("word: {} ---- TF-IDF:{}".format(j, tfidf(j, i, wordcount)))

        # 归一化
        tf_idfs += (tfidf(j, i, wordcount) ** 2)

    tf_idfs = tf_idfs ** 0.5
    print(tf_idfs)

    for j, k in i.items():
        print("归一化后：word: {} ---- TF-IDF:{}".format(j, tfidf(j, i, wordcount) / tf_idfs))

    # break

"""

part:1
word: what ---- TF-IDF:0.04794701207529681
word: is ---- TF-IDF:-0.03719059188570162
word: the ---- TF-IDF:0.11552453009332421
word: weather ---- TF-IDF:0.11552453009332421
word: like ---- TF-IDF:0.11552453009332421
word: today ---- TF-IDF:0.04794701207529681
part:2
word: what ---- TF-IDF:0.05753641449035617
word: is ---- TF-IDF:-0.044628710262841945
word: for ---- TF-IDF:0.13862943611198905
word: dinner ---- TF-IDF:0.13862943611198905
word: tonight ---- TF-IDF:0.13862943611198905
part:3
word: this ---- TF-IDF:0.11552453009332421
word: is ---- TF-IDF:-0.03719059188570162
word: a ---- TF-IDF:0.04794701207529681
word: question ---- TF-IDF:0.11552453009332421
word: worth ---- TF-IDF:0.11552453009332421
word: pondering ---- TF-IDF:0.11552453009332421
part:4
word: it ---- TF-IDF:0.11552453009332421
word: is ---- TF-IDF:-0.03719059188570162
word: a ---- TF-IDF:0.04794701207529681
word: beautiful ---- TF-IDF:0.11552453009332421
word: day ---- TF-IDF:0.11552453009332421
word: today ---- TF-IDF:0.04794701207529681

"""

N-元模型（n-gram）

给定一段文本序列, 其中n个词或字的相邻共现特征即n-gram特征, 常用的n-gram特征是bi-gram和tri-gram特征, 分别对应n为2和3。

# 一般n-gram中的n取2或者3, 这里取3为例
ngram_range = 3


def create_ngram_set(input_list):
    """
    description: 从数值列表中提取所有的n-gram特征
    :param input_list: 输入的数值列表, 可以看作是词汇映射后的列表,
                       里面每个数字的取值范围为[1, 25000]
    :return: n-gram特征组成的集合

    eg:
    # >>> create_ngram_set([1, 4, 9, 4, 1, 4])
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    """
    return set(zip(*[input_list[i:] for i in range(ngram_range)]))


if __name__ == '__main__':
    input_list = [1, 3, 2, 1, 5, 3]
    res = create_ngram_set(input_list)
    print(res)

word2vec

BOW和TF-IDF都只着重于词汇出现在文件中的次数，未考虑语言、文字有上下文的关联，针对上下文的关联，Google研发团队提出了词向量Word2vec，将每个单子改以上下文表达，然后转换为向量，这就是词嵌入（word embedding）,与TF-IDF输出的是稀疏向量不同，词嵌入的输出是一个稠密的样本空间。

词向量的两种做法：

# coding=utf-8
import gzip
import gensim

from gensim.test.utils import common_texts
# size：詞向量的大小，window：考慮上下文各自的長度
# min_count：單字至少出現的次數，workers：執行緒個數
model_simple = gensim.models.Word2Vec(sentences=common_texts, window=1,
                                      min_count=1, workers=4)
# 傳回 有效的字數及總處理字數
print(model_simple.train([["hello", "world", "michael"]], total_examples=1, epochs=2))

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model_simple = gensim.models.Word2Vec(min_count=1)
model_simple.build_vocab(sentences)  # 建立生字表(vocabulary)
print(model_simple.train(sentences, total_examples=model_simple.corpus_count
                         , epochs=model_simple.epochs))


# 載入 OpinRank 語料庫：關於車輛與旅館的評論
data_file="../nlp-in-practice-master/word2vec/reviews_data.txt.gz"

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


# 讀取 OpinRank 語料庫，並作前置處理
def read_input(input_file):
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f):
            # 前置處理
            yield gensim.utils.simple_preprocess(line)

# 載入 OpinRank 語料庫，分詞
documents = list(read_input(data_file))
# print(documents)


print(len(documents))

# Word2Vec 模型訓練，約10分鐘
model = gensim.models.Word2Vec(documents,
                               vector_size=150, window=10,
                               min_count=2, workers=10)
print(model.train(documents, total_examples=len(documents), epochs=10))


# 測試『骯髒』相似詞
w1 = "dirty"
print(model.wv.most_similar(positive=w1))
# positive：相似詞


# 測試『禮貌』相似詞
w1 = ["polite"]
print(model.wv.most_s服务器托管网imilar(positive=w1, topn=6))
# topn：只列出前 n 名


# 測試『法國』相似詞
w1 = ["france"]
print(model.wv.most_similar(positive=w1, topn=6))
# topn：只列出前 n 名


# 測試『床、床單、枕頭』相似詞及『長椅』相反詞
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
print(model.wv.most_similar(positive=w1, negative=w2, topn=10))
# negative：相反詞

# 比較兩詞相似機率
print(model.wv.similarity(w1="dirty", w2="smelly"))
print(model.wv.similarity(w1="dirty", w2="dirty"))

print(model.wv.similarity(w1="dirty", w2="clean"))

# 選出較不相似的字詞
print(model.wv.doesnt_match(["cat", "dog", "france"]))

# 關鍵詞萃取(Keyword Extraction)
# https://radimrehurek.com/gensim_3.8.3/summarization/keywords.html
# from gensim.summarization import keywords


# # 測試語料
# text = '''Challenges in natural language processing frequently involve
# speech recognition, natural language understanding, natural language
# generation (frequently from formal, machine-readable logical forms),
# connecting language and machine perception, dialog systems, or some
# combination thereof.'''

# 關鍵詞萃取
# print(''.join(keywords(text)))

Doc2Vec

Doc2vec模型是受到了Word2Vec模型的启发。Word2Vec里预测词向量时，预测出来的词是含有词义的，Doc2vec中也是构建了相同的结构，所以Doc2vec克服了词袋模型中没有语义的缺点。假设现在存在训练样本，每个句子是训练样本，和Word2Vec一样，Doc2vec也有两种训练方式，一种是分布记忆的段落向量（Distributed Memory Model of Paragraph Vectors , PV-DM）类似于Word2Vec中的CBOW模型，另一种是分布词袋版本的段落向量（Distributed Bag of Words version of Paragraph Vector，PV-DBOW）类似于Word2Vec中的Skip-gram模型。

# coding=utf-8
import numpy as np
import nltk
import gensim
from gensim.models import word2vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity

f = open('../data/FAQ/starbucks_faq.txt', 'r', encoding='utf8')
corpus = f.readlines()

print(corpus)

MAX_WORDS_A_LINE = 30
import string

print(string.punctuation)

stopword_list = set(nltk.corpus.stopwords.words('english')
                    + list(string.punctuation) + ['n'])


# 分詞函數
def tokenize(text, stopwords, max_len=MAX_WORDS_A_LINE):
    return [token for token in gensim.utils.simple_preprocess(text
                                                              , max_len=max_len) if token not in stopwords]


# 分詞
document_tokens = []  # 整理後的字詞
for line in corpus:
    document_tokens.append(tokenize(line, stopword_list))

# 設定為 Gensim 標籤文件格式
tagged_corpus = [TaggedDocument(doc, [i]) for i, doc in
                 enumerate(document_tokens)]

# 訓練 Doc2Vec 模型
model_d2v = Doc2Vec(tagged_corpus, vector_size=MAX_WORDS_A_LINE, epochs=200)
model_d2v.train(tagged_corpus, total_examples=model_d2v.corpus_count,
                      epochs=model_d2v.epochs)

# 測試
questions = []
for i in range(len(document_tokens)):
    questions.append(model_d2v.infer_vector(document_tokens[i]))
questions = np.array(questions)
# print(questions.shape)

# 測試語句
# text = "find allergen information"
# text = "mobile pay"
text = "verification code"
filtered_tokens = tokenize(text, stopword_list)
# print(filtered_tokens)

# 比較語句相似度
similarity = cosine_similarity(model_d2v.infer_vector(
    filtered_tokens).reshape(1, -1), questions, dense_output=False)

# 選出前 10 名
top_n = np.argsort(np.array(similarity[0]))[::-1][:10]
print(f'前 10 名 index:{top_n}n')
for i in top_n:
    print(round(similarity[0][i], 4), corpus[i].rstrip('n'))

Glove

Glove由斯坦福大学所提出的另一套词嵌入模型，他们认为Word2vec并未考虑全局的概率分布，只以移动窗口内的词汇为样本，没有掌握全文的信息。因此，他们提出了词汇共现矩阵，考虑词汇同时出现的概率，解决Wordvec只看局部的缺陷以及BOW稀疏向量空间的问题。

# coding=utf-8
# 載入相關套件
import numpy as np

# 載入GloVe詞向量檔 glove.6B.300d.txt
"""
https://github.com/stanfordnlp/GloVe
"""
embeddings_dict = {}
with open("../data/glove/glove.6B.300d.txt", 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

# 隨意測試一個單字(love)，取得 GloVe 的詞向量
# print(embeddings_dict['love'])

# 字數
# print(len(embeddings_dict.keys()))

# 以歐基里德(euclidean)距離計算相似性
from scipy.spatial.distance import euclidean


def find_closest_embeddings(embedding):
    return sorted(embeddings_dict.keys(),
                  key=lambda word: euclidean(embeddings_dict[word], embedding))


print(find_closest_embeddings(embeddings_dict["king"])[1:10])

# 任意選 100 個單字
# words = list(embeddings_dict.keys())[100:200]
# print(words)
words = find_closest_embeddings(embeddings_dict["king"])[1:10]

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# 以 T-SNE 降維至二個特徵
tsne = TSNE(n_components=2)
vectors = [embeddings_dict[word] for word in words]
Y = tsne.fit_transform(vectors)

# 繪製散佈圖，觀察單字相似度
plt.figure(figsize=(12, 8))
plt.axis('off')
plt.scatter(Y[:, 0], Y[:, 1])
for label, x, y in zip(words, Y[:, 0], Y[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords="offset points")

plt.show()

基于神经网络的Tokenizer

这一类向量化工具都是基于海量语料库训练出来的，使用方便且功能强大：huggingface-tokenizer，下面仅列出三个作为参考，在选取tokenizer时可以去看模型训练的语料库是否和你的目标任务场景相契合，选取合适的向量化工具

Bert-Tokenizer

Bert向量化工具：BertTokenizer的使用方法(超详细);

from transformers import BertTokenizer
from pytorch_pretrained import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert_pretrain')
sents = [
    '人工智能是计算机科学的一个分支。',
    '它企图了解智能的实质。',
    '人工智能是一门极富挑战性的科学。',
]
token = tokenizer.tokenize(sents[0])
print(token)
ids = tokenizer.convert_tokens_to_ids(token)
print(ids)
token = tokenizer.tokenize(sents[0])
print(token)
ids = tokenizer.convert_tokens_to_ids(token)
print(ids)
ids_encode = tokenizer.encode(sents[0])
print(ids_encode)
token_encode = tokenizer.convert_ids_to_tokens(ids_encode)
print(token_encode)
# 输出结果：
#['人', '工', '智', '能', '是', '计', '算', '机', '科', '学', '的', '一', '个', '分', '支', '。']
#[8, 35, 826, 52, 10, 159, 559, 98, 147, 18, 5, 7, 27, 59, 414, 12043]
#[1, 8, 35, 826, 52, 10, 159, 559, 98, 147, 18, 5, 7, 27, 59, 414, 12043, 2]
#['[CLS]', '人', '工', '智', '能', '是', '计', '算', '机', '科', '学', '的', '一', '个', '分', '支', '。', '[SEP]']
ids = tokenizer.encode_plus(sents[0])
print(ids)
# {'input_ids': [1, 8, 35, 826, 52, 10, 159, 559, 98, 147, 18, 5, 7, 27, 59, 414, 12043, 2], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
out = tokenizer.encode_plus(
    text=sents[0],
    text_pair=服务器托管网sents[1],

    #当句子长度大于max_length时,截断
    truncation=True,

    #一律补零到max_length长度
    padding='max_length',
    max_length=30,
    add_special_tokens=True,

    #可取值tf,pt,np,默认为返回list
    return_tensors=None,

    #返回token_type_ids
    return_token_type_ids=True,

    #返回attention_mask
    return_attention_mask=True,   

    #返回special_tokens_mask 特殊符号标识
    return_special_tokens_mask=True,

    #返回offset_mapping 标识每个词的起止位置,这个参数只能BertTokenizerFast使用
    #return_offsets_mapping=True,

    #返回length 标识长度
    return_length=True,
)

for k, v in out.items():
    print(k, ':', v)
#input_ids : [1, 8, 35, 826, 52, 10, 159, 559, 98, 147, 18, 5, 7, 27, 59, 414, 12043, 2, 380, 258, 429, 15, 273, 826, 52, 5, 79, 207, 12043, 2]
#token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
#special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
#attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
#length : 30
ids = tokenizer.batch_encode_plus([x for x in sents])
print(ids)
# {
#'input_ids': [[1, 8, 35, 826, 52, 10, 159, 559, 98, 147, 18, 5, 7, 27, 59, 414, 12043, 2], [1, 380, 258, 429, 15, 273, 826, 52, 5, 79, 207, 12043, 2], [1, 8, 35, 826, 52, 10, 7, 232, 456, 595, 1373, 267, 92, 5, 147, 18, 12043, 2]], 
#'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
#'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

CodeBert-Tokenizer

CodeBert ：CodeBert，语料库更注重于代码理解;

from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")
nl_tokens=tokenizer.tokenize("return maximum value")
# ['return', 'maximum', 'value']
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
# ['def', 'max', '(', 'a', ',', 'b', '):', 'if', 'a', '>', 'b', ':', 'return', 'a', #  'else', 'return', 'b']
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
# ['', 'return', 'maximum', 'value', '', 'def', 'max', '(', 'a', ',', 'b', '):', #'if', 'a', '>', 'b', ':', 'return', 'a', 'else', 'return', 'b', '']
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
# [0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, # # 671, 10, 1493, 671, 741, 2]
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
# torch.Size([1, 23, 768])
# tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
#        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
#        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
#        ...,
#        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
#        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
#        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
#       grad_fn=)

Claude-Tokenizer

Claude大模型的tokenizer：Claude-Tokenizer，语料库包含多种语言，还包含代码语言

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/claude-tokenizer')
assert tokenizer.encode('hello world') == [9381, 2253]

服务器托管，北京服务器托管，服务器租用 http://www.fwqtg.net