Text2vec

text2vec, Text to Vector.

文本向量表征工具，把文本转化为向量矩阵，是文本进行计算机处理的第一步。

text2vec实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型，并在文本语义匹配（相似度计算）任务上比较了各模型的效果。

Guide

Feature
Evaluation
Install
Usage
Contact
Reference

Feature

文本向量表示模型

Word2Vec：通过腾讯AI Lab开源的大规模高质量中文词向量数据（800万中文词轻量版） (文件名：light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe）实现词向量检索，本项目实现了句子（词向量求平均）的word2vec向量表示
SBERT(Sentence-BERT)：权衡性能和效率的句向量表示模型，训练时通过有监督训练上层分类函数，文本匹配预测时直接句子向量做余弦，本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
CoSENT(Cosine Sentence)：CoSENT模型提出了一种排序的损失函数，使训练过程更贴近预测，模型收敛速度和效果比Sentence-BERT更好，本项目基于PyTorch实现了CoSENT模型的训练和预测

Evaluation

文本匹配

英文匹配数据集的评测结果：

Arch	Backbone	Model Name	English-STS-B
GloVe	glove	Avg_word_embeddings_glove_6B_300d	61.77
BERT	bert-base-uncased	BERT-base-cls	20.29
BERT	bert-base-uncased	BERT-base-first_last_avg	59.04
BERT	bert-base-uncased	BERT-base-first_last_avg-whiten(NLI)	63.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-cls	73.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-first_last_avg	77.96
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	84.42
CoSENT	bert-base-uncased	CoSENT-base-first_last_avg	69.93
CoSENT	sentence-transformers/bert-base-nli-mean-tokens	CoSENT-base-nli-first_last_avg	79.68

中文匹配数据集的评测结果：

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
CoSENT	hfl/chinese-macbert-base	CoSENT-macbert-base	50.39	72.93	79.17	60.86	80.51	68.77	3008
CoSENT	Langboat/mengzi-bert-base	CoSENT-mengzi-base	50.52	72.27	78.69	12.89	80.15	58.90	2502
CoSENT	bert-base-chinese	CoSENT-bert-base	49.74	72.38	78.69	60.00	80.14	68.19	2653
SBERT	bert-base-chinese	SBERT-bert-base	46.36	70.36	78.72	46.86	66.41	61.74	3365
SBERT	hfl/chinese-macbert-base	SBERT-macbert-base	47.28	68.63	79.42	55.59	64.82	63.15	2948
CoSENT	hfl/chinese-roberta-wwm-ext	CoSENT-roberta-ext	50.81	71.45	79.31	61.56	81.13	68.85	-
SBERT	hfl/chinese-roberta-wwm-ext	SBERT-roberta-ext	48.29	69.99	79.22	44.10	72.42	62.80	-

本项目release模型的中文匹配评测结果：

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	33.86	23769
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	41.99	3138
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	48.25	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	-	-	-	-	-	-	-

说明：

结果值均使用spearman系数
结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
shibing624/text2vec-base-chinese模型，是用CoSENT方法训练，基于MacBERT在中文STS-B数据训练得到，并在中文STS-B测试集评估达到SOTA，运行examples/training_sup_text_matching_model.py代码可复现结果，模型文件已经上传到huggingface的模型库shibing624/text2vec-base-chinese，中文语义匹配任务推荐使用
SBERT-macbert-base模型，是用SBERT方法训练，运行examples/training_sup_text_matching_model.py代码复现结果
paraphrase-multilingual-MiniLM-L12-v2模型名称是sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2，是用SBERT训练，是paraphrase-MiniLM-L12-v2模型的多语言版本，支持中文、英文等
w2v-light-tencent-chinese是腾讯词向量的Word2Vec模型，CPU加载使用，适用于中文字面匹配任务和缺少数据的冷启动情况
各预训练模型均可以通过transformers调用，如MacBERT模型：--model_name hfl/chinese-macbert-base 或者roberta模型：--model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
中文匹配数据集下载链接见下方
中文匹配任务实验表明，pooling最优是first_last_avg，即 SentenceModel 的EncoderType.FIRST_LAST_AVG，其与EncoderType.MEAN的方法在预测效果上差异很小
QPS的GPU测试环境是Tesla V100，显存32GB

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install torch # conda install pytorch
pip install -U text2vec

or

pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .

Usage

文本向量表征

基于pretrained model计算文本向量：

>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡',
        'This framework generates embeddings for each input sentence',
        'Sentences are passed as a list of string.',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    # 中文句向量模型(CoSENT)，中文语义匹配任务推荐，支持fine-tune继续训练
    t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
    compute_emb(t2v_model)

    # 支持多语言的句向量模型（Sentence-BERT），英文语义匹配任务推荐，支持fine-tune继续训练
    sbert_model = SentenceModel("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
    compute_emb(sbert_model)

    # 中文词向量模型(word2vec)，中文字面匹配任务和冷启动适用
    w2v_model = Word2Vec("w2v-light-tencent-chinese")
    compute_emb(w2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ...

返回值embeddings是numpy.ndarray类型，shape为(sentences_size, model_embedding_size)，三个模型任选一种即可，推荐用第一个。
shibing624/text2vec-base-chinese模型是CoSENT方法在中文STS-B数据集训练得到的，模型已经上传到huggingface的模型库shibing624/text2vec-base-chinese，是text2vec.SentenceModel指定的默认模型，可以通过上面示例调用，或者如下所示用transformers库调用，模型自动下载到本机路径：~/.cache/huggingface/transformers
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2模型是Sentence-BERT的多语言句向量模型，适用于释义（paraphrase）识别，文本匹配，通过text2vec.SentenceModel和sentence-transformers库都可以调用该模型
w2v-light-tencent-chinese是通过gensim加载的Word2Vec模型，使用腾讯词向量Tencent_AILab_ChineseEmbedding.tar.gz计算各字词的词向量，句子向量通过单词词向量取平均值得到，模型自动下载到本机路径：~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load model and predict:

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

`Word2Vec`词向量

提供两种Word2Vec词向量，任选一个：

轻量版腾讯词向量百度云盘-密码:tawe 或谷歌云盘，二进制文件，111M，是简化后的高频143613个词，每个词向量还是200维（跟原版一样），运行程序，自动下载到 ~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin
腾讯词向量-官方全量, 6.78G放到： ~/.text2vec/datasets/Tencent_AILab_ChineseEmbedding.txt，腾讯词向量主页：https://ai.tencent.com/ailab/nlp/zh/index.html 词向量下载地址：https://ai.tencent.com/ailab/nlp/en/download.html 更多查看腾讯词向量介绍-wiki

下游任务

1. 句子相似度计算

example: examples/semantic_text_similarity_demo.py

import sys

sys.path.append('..')
from text2vec import Similarity

# Two lists of sentences
sentences1 = ['如何更换花呗绑定银行卡',
              'The cat sits outside',
              'A man is playing guitar',
              'The new movie is awesome']

sentences2 = ['花呗更改绑定银行卡',
              'The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

sim_model = Similarity()
for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        score = sim_model.get_score(sentences1[i], sentences2[j])
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], score))

output:

如何更换花呗绑定银行卡 		 花呗更改绑定银行卡 		 Score: 0.9477
如何更换花呗绑定银行卡 		 The dog plays in the garden 		 Score: -0.1748
如何更换花呗绑定银行卡 		 A woman watches TV 		 Score: -0.0839
如何更换花呗绑定银行卡 		 The new movie is so great 		 Score: -0.0044
The cat sits outside 		 花呗更改绑定银行卡 		 Score: -0.0097
The cat sits outside 		 The dog plays in the garden 		 Score: 0.1908
The cat sits outside 		 A woman watches TV 		 Score: -0.0203
The cat sits outside 		 The new movie is so great 		 Score: 0.0302
A man is playing guitar 		 花呗更改绑定银行卡 		 Score: -0.0010
A man is playing guitar 		 The dog plays in the garden 		 Score: 0.1062
A man is playing guitar 		 A woman watches TV 		 Score: 0.0055
A man is playing guitar 		 The new movie is so great 		 Score: 0.0097
The new movie is awesome 		 花呗更改绑定银行卡 		 Score: 0.0302
The new movie is awesome 		 The dog plays in the garden 		 Score: -0.0160
The new movie is awesome 		 A woman watches TV 		 Score: 0.1321
The new movie is awesome 		 The new movie is so great 		 Score: 0.9591

句子余弦相似度值score范围是[-1, 1]，值越大越相似。

2. 文本匹配搜索

一般在文档候选集中找与query最相似的文本，常用于QA场景的问句相似匹配、文本相似检索等任务。

example: examples/semantic_search_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel, cos_sim, semantic_search

embedder = SentenceModel()

# Corpus with example sentences
corpus = [
    '花呗更改绑定银行卡',
    '我什么时候开通了花呗',
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = [
    '如何更换花呗绑定银行卡',
    'A man is eating pasta.',
    'Someone in a gorilla costume is playing a set of drums.',
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embedding = embedder.encode(query)
    hits = semantic_search(query_embedding, corpus_embeddings, top_k=5)
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    hits = hits[0]  # Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

output:

Query: 如何更换花呗绑定银行卡
Top 5 most similar sentences in corpus:
花呗更改绑定银行卡 (Score: 0.9477)
我什么时候开通了花呗 (Score: 0.3635)
A man is eating food. (Score: 0.0321)
A man is riding a horse. (Score: 0.0228)
Two men pushed carts through the woods. (Score: 0.0090)

======================
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.6734)
A man is eating a piece of bread. (Score: 0.4269)
A man is riding a horse. (Score: 0.2086)
A man is riding a white horse on an enclosed ground. (Score: 0.1020)
A cheetah is running behind its prey. (Score: 0.0566)

======================
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.8167)
A cheetah is running behind its prey. (Score: 0.2720)
A woman is playing violin. (Score: 0.1721)
A man is riding a horse. (Score: 0.1291)
A man is riding a white horse on an enclosed ground. (Score: 0.1213)

======================
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.9147)
A monkey is playing drums. (Score: 0.2655)
A man is riding a horse. (Score: 0.1933)
A man is riding a white horse on an enclosed ground. (Score: 0.1733)
A man is eating food. (Score: 0.0329)

下游任务支持库

similarities库[推荐]

文本相似度计算和文本匹配搜索任务，推荐使用 similarities库，兼容本项目release的 Word2vec、SBERT、Cosent类语义匹配模型，还支持字面维度相似度计算、匹配搜索算法，支持文本、图像。

安装： pip install -U similarities

句子相似度计算：

from similarities import Similarity

m = Similarity()
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}")  # similarity score: 0.855146050453186

Models

CoSENT model

CoSENT（Cosine Sentence）文本匹配模型，在Sentence-BERT上改进了CosineRankLoss的句向量方案

Network structure:

Training:

Inference:

CoSENT 监督模型

训练和预测CoSENT模型：

在中文STS-B数据集训练和评估CoSENT模型

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-cosent

在蚂蚁金融匹配数据集ATEC上训练和评估CoSENT模型

支持这些中文匹配数据集的使用：'ATEC', 'STS-B', 'BQ', 'LCQMC', 'PAWSX'，具体参考HuggingFace datasets https://huggingface.co/datasets/shibing624/nli_zh

python training_sup_text_matching_model.py --task_name ATEC --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/ATEC-cosent

在自有中文数据集上训练模型

example: examples/training_sup_text_matching_model_selfdata.py

python training_sup_text_matching_model_selfdata.py --do_train --do_predict

在英文STS-B数据集训练和评估CoSENT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased  --output_dir ./outputs/STS-B-en-cosent

CoSENT 无监督模型

在英文NLI数据集训练CoSENT模型，在STS-B测试集评估效果

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-cosent

Sentence-BERT model

Sentence-BERT文本匹配模型，表征式句向量表示方案

Network structure:

Training:

Inference:

SentenceBERT 监督模型

在中文STS-B数据集训练和评估SBERT模型

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-sbert

在英文STS-B数据集训练和评估SBERT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-sbert

SentenceBERT 无监督模型

在英文NLI数据集训练SBERT模型，在STS-B测试集评估效果

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-sbert

BERT-Match model

BERT文本匹配模型，原生BERT匹配网络结构，交互式句向量匹配模型

Network structure:

Training and inference:

训练脚本同上examples/training_sup_text_matching_model.py。

模型蒸馏（Model Distillation）

由于text2vec训练的模型可以使用sentence-transformers库加载，此处复用其模型蒸馏方法distillation。

模型降维，参考dimensionality_reduction.py使用PCA对模型输出embedding降维，可减少milvus等向量检索数据库的存储压力，还能轻微提升模型效果。
模型蒸馏，参考model_distillation.py使用蒸馏方法，将Teacher大模型蒸馏到更少layers层数的student模型中，在权衡效果的情况下，可大幅提升模型预测速度。

模型部署

提供两种部署模型，搭建服务的方法： 1）基于Jina搭建gRPC服务【推荐】；2）基于FastAPI搭建原生Http服务。

Jina服务

采用C/S模式搭建高性能服务，支持docker云原生，gRPC/HTTP/WebSocket，支持多个模型同时预测，GPU多卡处理。

安装： pip install jina
启动服务：

example: examples/jina_server_demo.py

from jina import Flow

port = 50001
f = Flow(port=port).add(
    uses='jinahub://Text2vecEncoder',
    uses_with={'model_name': 'shibing624/text2vec-base-chinese'}
)

with f:
    # backend server forever
    f.block()

该模型预测方法（executor）已经上传到JinaHub，里面包括docker、k8s部署方法。

调用服务：

from jina import Client
from docarray import Document, DocumentArray

port = 50001

c = Client(port=port)

data = ['如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡']
print("data:", data)
print('data embs:')
r = c.post('/', inputs=DocumentArray([Document(text='如何更换花呗绑定银行卡'), Document(text='花呗更改绑定银行卡')]))
print(r.embeddings)

批量调用方法见example: examples/jina_client_demo.py

FastAPI服务

安装： pip install fastapi uvicorn
启动服务：

example: examples/fastapi_server_demo.py

cd examples
python fastapi_server_demo.py

调用服务：

curl -X 'GET' \
  'http://0.0.0.0:8001/emb?q=hello' \
  -H 'accept: application/json'

数据集

中文语义匹配数据集已经上传到huggingface datasets https://huggingface.co/datasets/shibing624/nli_zh

数据集使用示例：

pip install datasets

from datasets import load_dataset

dataset = load_dataset("shibing624/nli_zh", "STS-B") # ATEC or BQ or LCQMC or PAWSX or STS-B
print(dataset)
print(dataset['test'][0])

output:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 5231
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1458
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1361
    })
})
{'sentence1': '一个女孩在给她的头发做发型。', 'sentence2': '一个女孩在梳头。', 'label': 2}

常见中文语义匹配数据集，包含ATEC、BQ、 LCQMC、PAWSX、STS-B共5个任务。可以从数据集对应的链接自行下载，也可以从百度网盘(提取码:qkt6)下载。其中senteval_cn目录是评测数据集汇总，senteval_cn.zip是senteval目录的打包，两者下其一就好。

文本向量方法介绍

Question

文本向量表示咋做？文本匹配任务用哪个模型效果好？

许多NLP任务的成功离不开训练优质有效的文本表示向量。特别是文本语义匹配（Semantic Textual Similarity，如paraphrase检测、QA的问题对匹配）、文本向量检索（Dense Text Retrieval）等任务。

Solution

传统方法：基于特征的匹配

基于TF-IDF、BM25、Jaccord、SimHash、LDA等算法抽取两个文本的词汇、主题等层面的特征，然后使用机器学习模型（LR, xgboost）训练分类模型
优点：可解释性较好
缺点：依赖人工寻找特征，泛化能力一般，而且由于特征数量的限制，模型的效果比较一般

代表模型：

BM25

BM25算法，通过候选句子的字段对qurey字段的覆盖程度来计算两者间的匹配得分，得分越高的候选项与query的匹配度更好，主要解决词汇层面的相似度问题。

深度方法：基于表征的匹配

基于表征的匹配方式，初始阶段对两个文本各自单独处理，通过深层的神经网络进行编码（encode），得到文本的表征（embedding），再对两个表征进行相似度计算的函数得到两个文本的相似度
优点：基于BERT的模型通过有监督的Fine-tune在文本表征和文本匹配任务取得了不错的性能
缺点：BERT自身导出的句向量（不经过Fine-tune，对所有词向量求平均）质量较低，甚至比不上Glove的结果，因而难以反映出两个句子的语义相似度

主要原因是：

1.BERT对所有的句子都倾向于编码到一个较小的空间区域内，这使得大多数的句子对都具有较高的相似度分数，即使是那些语义上完全无关的句子对。

2.BERT句向量表示的聚集现象和句子中的高频词有关。具体来说，当通过平均词向量的方式计算句向量时，那些高频词的词向量将会主导句向量，使之难以体现其原本的语义。当计算句向量时去除若干高频词时，聚集现象可以在一定程度上得到缓解，但表征能力会下降。

代表模型：

由于2018年BERT模型在NLP界带来了翻天覆地的变化，此处不讨论和比较2018年之前的模型（如果有兴趣了解的同学，可以参考中科院开源的MatchZoo 和MatchZoo-py）。

所以，本项目主要调研以下比原生BERT更优、适合文本匹配的向量表示模型：Sentence-BERT(2019)、BERT-flow(2020)、SimCSE(2021)、CoSENT(2022)。

深度方法：基于交互的匹配

基于交互的匹配方式，则认为在最后阶段才计算文本的相似度会过于依赖文本表征的质量，同时也会丢失基础的文本特征（比如词法、句法等），所以提出尽可能早的对文本特征进行交互，捕获更基础的特征，最后在高层基于这些基础匹配特征计算匹配分数
优点：基于交互的匹配模型端到端处理，效果好
缺点：这类模型（Cross-Encoder）的输入要求是两个句子，输出的是句子对的相似度值，模型不会产生句子向量表示（sentence embedding），我们也无法把单个句子输入给模型。因此，对于需要文本向量表示的任务来说，这类模型并不实用

代表模型：

Cross-Encoder适用于向量检索精排。

Contact

Issue(建议)：
邮件我：xuming: [email protected]
微信我：加我微信号：xuming624, 备注：姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了text2vec，请按如下格式引用：

APA:

Xu, M. Text2vec: Text to vector toolkit (Version 1.1.2) [Computer software]. https://github.com/shibing624/text2vec

BibTeX:

@misc{Text2vec,
  author = {Xu, Ming},
  title = {Text2vec: Text to vector toolkit},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shibing624/text2vec}},
}

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加text2vec的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python -m pytest -v来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github		.github
docs		docs
examples		examples
tests		tests
text2vec		text2vec
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

chatopera/text2vec

Folders and files

Latest commit

History

Repository files navigation

Text2vec

Feature

文本向量表示模型

Evaluation

文本匹配

Demo

Install

Usage

文本向量表征

Usage (HuggingFace Transformers)

Usage (sentence-transformers)

Word2Vec词向量

下游任务

1. 句子相似度计算

2. 文本匹配搜索

下游任务支持库

Models

CoSENT model

CoSENT 监督模型

CoSENT 无监督模型

Sentence-BERT model

SentenceBERT 监督模型

SentenceBERT 无监督模型

BERT-Match model

模型蒸馏（Model Distillation）

模型部署

Jina服务

FastAPI服务

数据集

Question

Solution

传统方法：基于特征的匹配

深度方法：基于表征的匹配

深度方法：基于交互的匹配

Contact

Citation

License

Contribute

Reference

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Word2Vec`词向量

Packages