|
| 1 | +# Evaluation of Large Language Models via Coupled Token Generation |
| 2 | + |
| 3 | +This repository contains the code and data for the paper "Evaluation of Large Language Models via Coupled Token Generation" |
| 4 | +by Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, |
| 5 | +Suhas Thejaswi, and Manuel Gomez-Rodriguez. |
| 6 | + |
| 7 | +## Dependencies |
| 8 | + |
| 9 | +All the experiments were performed using Python 3.11. In order to create a virtual environment and install the project dependencies you can run the following commands: |
| 10 | + |
| 11 | +```bash |
| 12 | +python3 -m venv env |
| 13 | +source env/bin/activate |
| 14 | +pip install -r requirements.txt |
| 15 | +``` |
| 16 | + |
| 17 | +## Code organization |
| 18 | + |
| 19 | +The directory [data](data/) contains the data used for the experiments. |
| 20 | + |
| 21 | +The directory [models](models/) contains the list of models used. |
| 22 | + |
| 23 | +The directory [src](src/) contains the source code for the experiments. |
| 24 | + |
| 25 | +The directory [scripts](scripts/) contains bash scripts that use code under [src](src/) to run the experiments. |
| 26 | + |
| 27 | +The directory [notebooks](notebooks/) contains jupyter notebooks producing the figures appearing in the paper. |
| 28 | + |
| 29 | +The directory [figures](figures/) is used for saving the figures produced by the notebooks. |
| 30 | + |
| 31 | +The directory [outputs](outputs/) is used for saving the outputs produced by the scripts. |
| 32 | + |
| 33 | +## Instructions |
| 34 | + |
| 35 | +### Downloading the models |
| 36 | + |
| 37 | +Our experiments use LLMs from the Llama family. |
| 38 | +Llama is a "gated" model, that is, it requires licensing to use. |
| 39 | +You can request to access it at: [https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). |
| 40 | +Once you have access, you can download any model in the Llama family. |
| 41 | +Then, before running the scripts you need to authenticate with your Hugging Face account by running `huggingface-cli` login in the terminal. |
| 42 | +Each model will be downloaded to the [models](models/) folder the first time it is called from a script. |
| 43 | + |
| 44 | +### Setting up |
| 45 | + |
| 46 | +Run `python3 src/merge_tokenizers.py` before running the scripts to set up the joint vocabulary. |
| 47 | + |
| 48 | + |
| 49 | +### MMLU experiment |
| 50 | +The final output files of the experiment are provided in the [outputs/mmlu](outputs/mmlu/) directory. |
| 51 | +To reproduce the figures in the paper, you only need to run the [mmlu.ipynb](notebooks/mmlu.ipynb) notebook. |
| 52 | + |
| 53 | +The script [mmlu.sh](scripts/mmlu.sh) produces the outputs of one LLM, using one seed, given the questions from the MMLU dataset as input prompts. |
| 54 | +To reproduce all the outputs, run the script twice for each model (for independent and coupled generation), using the seeds provided in the script. |
| 55 | + |
| 56 | + |
| 57 | +### LMSYS experiment |
| 58 | + |
| 59 | +The final output files of the experiment are provided in the [outputs/LMSYS](outputs/LMSYS/) directory. |
| 60 | +To reproduce the figures in the paper, you only need to run the [lmsys.ipynb](notebooks/lmsys.ipynb) notebook. |
| 61 | + |
| 62 | +The script [lmsys.sh](scripts/lmsys.sh) produces the outputs of one LLM, using one seed, to the questions from the dataset in [data/processed/LMSYS/questions.json](data/processed/LMSYS/questions.json). |
| 63 | +To reproduce all the outputs, run the script twice for each model (for independent and coupled generation), using the seeds provided in the script. |
| 64 | +The results of the pairwise comparisons of these outputs by GPT-4o-2024-11-20 are provided in the [outputs/LMSYS](outputs/LMSYS) directory. |
| 65 | + |
| 66 | +## Contact & attribution |
| 67 | + |
| 68 | +In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact [Ivi Chatzi ](mailto:[email protected]) or [Eleni Straitouri ](mailto:[email protected]). |
| 69 | + |
| 70 | +If you use parts of the code in this repository for your own research, please consider citing: |
| 71 | + |
| 72 | +``` |
| 73 | +@article{benz2025evaluation, |
| 74 | + title={Evaluation of Large Language Models via Coupled Token Generation}, |
| 75 | + author={Nina Corvelo Benz and Stratis Tsirtsis and Eleni Straitouri and Ivi Chatzi and Ander Artola Velasco and Suhas Thejaswi and Manuel Gomez-Rodriguez}, |
| 76 | + year={2025}, |
| 77 | + journal={arXiv preprint arXiv:coming_soon} |
| 78 | +} |
| 79 | +``` |
0 commit comments