-
Notifications
You must be signed in to change notification settings - Fork 44
longbench_en
LongBench is a benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This project tested the performance of the relevant models on the LongBench dataset.
Set up the environment according to requirements.txt
, the related dependencies are reproduced as follows:
datasets
tqdm
rouge
jieba
fuzzywuzzy
einops
torch>=2.0.1
transformers==4.37.2
The inference script will automatically download the dataset from 🤗 Datasets.
Run the following script:
model_path=path/to/chinese_mixtral
output_path=path/to/output_dir
data_class=zh
with_inst="true" # or "false" or "auto"
max_length=32256
cd scripts/longbench
python pred_mixtral.py \
--model_path ${model_path} \
--predict_on ${data_class} \
--output_dir ${output_dir} \
--with_inst ${with_inst} \
--max_length ${max_length} \
--load_in_4bit \
--use_flash_attention_2 \
-
--model_path ${model_path}
: Path to the model to be evaluated (the full Chinese-Mixtral model or Chinese-Mixtral-Instrcut model, not LoRA). -
--predict_on {data_class}
: The tasks to predict. Possible values areen
,zh
,code
or their combination such asen,zh,code
. -
--output_dir ${output_dir}
:Output directory of the predictions and logs. -
--max_length ${max_length}
:Max length of the instructions. Notice that the lengths of system prompt and task-related prompt is not included. -
--with_inst ${with_inst}
:Whether use the system prompt and template of Chinese-Mixtral-Instruct when constructing the instructions:-
true
:Use the system prompt and template on all tasks -
false
:Use the system prompt and template on none of tasks -
auto
:Use the system prompt and template on some tasks (default strategy of LongBench official code)
We suggest setting
--with_inst
tofalse
. -
-
--gpus ${gpus}
:Specify GPUs with this argument, such as0,1
. -
--e
:Predict on the LongBench-E dataset. See the official documentation for details of LongBench-E. -
--load_in_4bit
: Loads the model in 4-bit quantization form -
--use_flash_attention_2
: Use flash-attn2 to accelerate inference, otherwise use SDPA to accelerate.
When the script has finished running, the prediction files are stored under ${output_dir}/pred/
or ${output_dir}/pred_e/
(depends on if testing on LongBench-E). Run the following command to compute metrics:
python eval.py --output_dir ${output_dir}
If testing on LongBench-E, provide -e
when computing metrics:
python eval.py --output_dir ${output_dir} -e
The results are stored in ${output_dir}/result.json
or ${output_dir}/pred_e/result.json
. For example, the results of Chinese-Mixtral-Instruct on LongBench Chinese tasks (--predict_on zh
) are:
{
"lsht": 42.0,
"multifieldqa_zh": 50.28,
"passage_retrieval_zh": 89.5,
"vcsum": 16.41,
"dureader": 34.15
}
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ