Hi, I am trying to reproduce the evaluation results and find significant gap as mentioned in a previous issue marked as completed. I guess the gap is mainly caused by the mismatch between the default format of the RAG data provided in the README and the actual RAG data the authors used in the R1-Searcher paper. I wonder have the authors updated the actual knowledge base or the script for extracting abstracts?
Btw. Here are my reproduced evaluation results on HotpotQA.
{'F1': 0.6034644757707149, 'EM': 0.48, 'CEM': 0.544, 'time_use_in_second': 83.04516339302063, 'time_use_in_minite': '1:23'}