PaddlePaddle
diff --git a/‎LICENSE
+1-1 b/‎LICENSE
+1-1
diff --git a/‎README.md
+189-2 b/‎README.md
+189-2
diff --git a/‎examples/example.py
+38 b/‎examples/example.py
+38
diff --git a/‎examples/faiss_example/index.py
+49 b/‎examples/faiss_example/index.py
+49
diff --git a/‎examples/faiss_example/query.py
+27 b/‎examples/faiss_example/query.py
+27
diff --git a/‎examples/faiss_example/requirements.txt
+1 b/‎examples/faiss_example/requirements.txt
+1
@@ -1,4 +1,4 @@
-                                 Apache License
+Apache License
                            Version 2.0, January 2004
                         http://www.apache.org/licenses/
 
 
@@ -1,2 +1,189 @@
-# RocketQA
-RocketQA
+# RocketQA End-to-End QA-system Development Tool
+
+This repository provides a simple and efficient toolkit for running RocketQA models and build a Question Answering (QA) system. 
+
+## RocketQA
+**RocketQA** is a series of dense retrieval models for Open-Domain QA. 
+
+Open-Domain QA aims to find the answers of natural language questions from a large collection of documents. Common approaches often contain two stages, firstly a dense retriever selects a few relevant contexts, and then a neural reader extracts the answer.
+
+RocketQA focuses on improving the dense contexts retrieval stage, and propose the following methods:
+#### 1. [RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/pdf/2010.08191.pdf)
+
+#### 2. [PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval](https://aclanthology.org/2021.findings-acl.191.pdf)
+
+#### 3. [RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking](https://arxiv.org/pdf/2110.07367.pdf)
+
+
+## Features
+* ***State-of-the-art***, RocketQA models achieve SOTA performance on MSMARCO passage ranking dataset and Natural Question dataset.
+* ***First-Chinese-model***, RocketQA-zh is the first open source Chinese dense retrieval model.
+* ***Easy-to-use***, both python installation package and DOCKER environment are provided.
+* ***Solution-for-QA-system***, developers can build an End-to-End QA system with one line of code.
+  
+  
+
+## Installation
+
+### Install python package
+First, install [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html).
+```bash
+# GPU version:
+$ pip install paddlepaddle-gpu
+
+# CPU version:
+$ pip install paddlepaddle
+```
+
+Second, install rocketqa package:
+```bash
+$ pip install rocketqa
+```
+
+NOTE: RocketQA package MUST be running on Python3.6+ with [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) 2.0+ :
+
+### Download Docker environment
+
+```bash
+docker pull rocketqa/rocketqa
+
+docker run -it docker.io/rocketqa/rocketqa bash
+```
+
+  
+## API
+The RocketQA development tool supports two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. And the development tool provides the following methods:
+
+#### [`rocketqa.available_models()`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/rocketqa.py#L17)
+
+Returns the names of the available RocketQA models. 
+
+#### [`rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/rocketqa.py#L52)
+
+Returns the model specified by the input parameter. Both dual encoder and cross encoder can be initialized by this method. With input parameter, developers can load RocketQA models returned by "available_models()" or their own checkpoints.
+
+---
+
+Dual-encoder returned by "load_model()" supports the following methods:
+
+#### [`model.encode_query(query: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/predict/dual_encoder.py#L126)
+
+Given a list of queries, returns their representation vectors encoded by model.
+
+#### [`model.encode_para(para: List[str], title: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/predict/dual_encoder.py#L154)
+
+Given a list of passages and their corresponding titles (optional), returns their representations vectors encoded by model.
+
+#### [`model.matching(query: List[str], para: List[str], title: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/predict/dual_encoder.py#L187)
+
+Given a list of queries and passages (and titles), returns their matching scores (dot product between two representation vectors). 
+
+---
+
+Cross-encoder returned by "load_model()" supports the following method:
+
+#### [`model.matching(query: List[str], para: List[str], title: List[str])`](https://github.com/PaddlePaddle/RocketQA/blob/3a99cf2720486df8cc54acc0e9ce4cbcee993413/rocketqa/predict/cross_encoder.py#L129)
+
+Given a list of queries and passages (and titles), returns their matching scores (probability that the paragraph is the query's right answer).
+  
+  
+
+## Examples
+
+With the examples below, developers can run RocketQA models or their own checkpoints. 
+
+###  Run RocketQA Model
+To run RocketQA models, developers should set the parameter `model` in 'load_model()' method with RocketQA model name return by 'available_models()' method.
+
+```python
+import rocketqa
+
+query_list = ["trigeminal definition"]
+para_list = [
+    "Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]
+
+# init dual encoder
+dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, batch_size=16)
+
+# encode query & para
+q_embs = dual_encoder.encode_question(query=query_list)
+p_embs = dual_encoder.encode_passage(para=para_list)
+# compute dot product of query representation and para representation
+dot_products = dual_encoder.matching(query=query_list, para=para_list)
+```
+
+### Run Self-development Model
+To run checkpoints, developers should write a config file, and set the parameter `model` in 'load_model()' method with the path of the config file.
+
+```python
+import rocketqa
+
+query_list = ["交叉验证的作用"]
+title_list = ["交叉验证的介绍"]
+para_list = ["交叉验证(Cross-validation)主要用于建模应用中，例如PCR 、PLS回归建模中。在给定的建模样本中，拿出大部分样本进行建模型，留小部分样本用刚建立的模型进行预报，并求这小部分样本的预报误差，记录它们的平方加和。"]
+
+# conf
+ce_conf = {
+    "model": "./own_model/config.json",     # path of config file
+    "use_cuda": True,
+    "device_id": 0,
+    "batch_size": 16
+}
+
+# init cross encoder
+cross_encoder = rocketqa.load_model(**ce_conf)
+
+# compute matching score of query and para
+ranking_score = cross_encoder.matching(query=query_list, para=para_list, title=title_list)
+```
+
+The config file is a JSON format file.
+```bash
+{
+    "model_type": "cross_encoder",
+    "max_seq_len": 160,
+    "model_conf_path": "en_large_config.json",  # path relative to config file
+    "model_vocab_path": "en_vocab.txt",         # path relative to config file
+    "model_checkpoint_path": "marco_cross_encoder_large", # path relative to config file
+    "joint_training": 0
+}
+```
+  
+
+
+## Start your QA-System
+
+With the examples below, developers can build own QA-System
+
+### Running with JINA
+```bash
+cd examples/jina_example/
+pip3 install -r requirements.txt
+
+# Index
+python3 app.py index
+
+# Search
+python3 app.py query
+
+To know more, please visit [JINA example](https://github.com/PaddlePaddle/RocketQA/tree/main/examples/jina_example)
+```
+
+
+
+### Running with Faiss
+
+```bash
+cd examples/faiss_example/
+pip3 install -r requirements.txt
+
+# Index
+python3 index.py ${language} ${data_file} ${index_file}
+
+# Start service
+python3 rocketqa_service.py ${language} ${data_file} ${index_file}
+
+# request
+python3 query.py
+```
+
@@ -0,0 +1,38 @@
+import os
+import sys
+import rocketqa
+
+query_list = []
+para_list = []
+title_list = []
+marco_q_file = 'marco.q'
+for line in open(marco_q_file):
+    query_list.append(line.strip())
+
+marco_tp_file = 'marco.tp.1k'
+for line in open(marco_tp_file):
+    t, p = line.strip().split('\t')
+    para_list.append(p)
+    title_list.append(t)
+
+dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=32)
+
+q_embs = dual_encoder.encode_question(query=query_list)
+for q in q_embs:
+    print (' '.join(str(ii) for ii in q))
+p_embs = dual_encoder.encode_passage(para=para_list, title=title_list)
+for p in p_embs:
+    print (' '.join(str(ii) for ii in p))
+ips = dual_encoder.matching(query=query_list, \
+                            para=para_list[:len(query_list)], \
+                            title=title_list[:len(query_list)])
+for ip in ips:
+    print (ip)
+
+cross_encoder = rocketqa.load_model(model="v1_marco_ce", use_cuda=True, device_id=0, batch_size=32)
+ranking_score = cross_encoder.matching(query=query_list, \
+                                       para=para_list[:len(query_list)], \
+                                       title=title_list[:len(query_list)])
+for rs in ranking_score:
+    print (rs)
+
@@ -0,0 +1,49 @@
+import os
+import sys
+import faiss
+import rocketqa
+
+
+def build_index(encoder_conf, index_file_name, title_list, para_list):
+
+    dual_encoder = rocketqa.load_model(**encoder_conf)
+    para_embs = dual_encoder.encode_passage(para=para_list, title=title_list)
+
+    indexer = faiss.IndexFlatIP(768)
+    indexer.add(para_embs.astype('float32'))
+    faiss.write_index(indexer, index_file_name)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) != 4:
+        print ("USAGE: ")
+        print ("      python3 index.py ${language} ${data_file} ${index_file}")
+        print ("--For Example:")
+        print ("      python3 index.py zh ../marco.tp.1k marco_test.index")
+        exit()
+
+    language = sys.argv[1]
+    data_file = sys.argv[2]
+    index_file = sys.argv[3]
+    if language == 'zh':
+        model = 'zh_dureader_de'
+    elif language == 'en':
+        model = 'v1_marco_de'
+    else:
+        print ("illegal language, only [zh] and [en] is supported", file=sys.stderr)
+        exit()
+
+    para_list = []
+    title_list = []
+    for line in open(data_file):
+        t, p = line.strip().split('\t')
+        para_list.append(p)
+        title_list.append(t)
+
+    de_conf = {
+            "model": model,
+            "use_cuda": True,
+            "device_id": 0,
+            "batch_size": 32
+    }
+    build_index(de_conf, index_file, title_list, para_list)
@@ -0,0 +1,27 @@
+import sys
+import requests
+import json
+
+SERVICE_ADD = 'http://localhost:8888/rocketqa'
+TOPK = 5
+
+while 1:
+    query = input("please input a query:\t")
+    if query.strip() == '':
+        break
+
+    input_data = {}
+    input_data['query'] = query
+    input_data['topk'] = TOPK
+    json_str = json.dumps(input_data)
+
+    result = requests.post(SERVICE_ADD, json=input_data)
+    res_json = json.loads(result.text)
+
+    print ("QUERY:\t" + query)
+    for i in range(TOPK):
+        title = res_json['answer'][i]['title']
+        para = res_json['answer'][i]['para']
+        score = res_json['answer'][i]['probability']
+        print ('{}'.format(i + 1) + '\t' + title + '\t' + para + '\t' + str(score))
+
@@ -0,0 +1 @@
+faiss-cpu
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`- Apache License`
	`1`	`+Apache License`
`2`	`2`	`Version 2.0, January 2004`
`3`	`3`	`http://www.apache.org/licenses/`
`4`	`4`