This is a reproduction repository for analyzing the impacts of under-specification on LLM behaviors.
Refer to the full experiment setup in the paper.
We share all experiment configurations in data/configs, all prompts in data/prompts, all curated requirements in data/requirements, and the evaluation results here.
Download evaluation data from here. Create three repositores data/results/commitpack, data/results/trip, data/results/product, and uncompress the evaluation results into each repository.
Run steps in analysis-reproduction.ipynb.
First, install the dependencies with poetry: poetry install.
Next, add your OpenAI key for running the OpenAI models, and Bedrock key for running the Llama3 models.
poetry run python3 run.py --config=data/configs/commitpack_main.yaml
poetry run python3 run.py --config=data/configs/trip_main.yaml
poetry run python3 run.py --config=data/configs/product_main.yaml poetry run python3 run.py --config=data/configs/commitpack_fix.yaml
poetry run python3 run.py --config=data/configs/trip_fix.yaml
poetry run python3 run.py --config=data/configs/product_fix.yaml To rerun prompt optimization, use
poetry run python3 -m analysis.optimize --config=data/configs/commitpack_optimizer_gen.yaml
poetry run python3 -m analysis.optimize --config=data/configs/trip_optimizer_gen.yaml
poetry run python3 -m analysis.optimize --config=data/configs/product_optimizer_gen.yaml To reused the optimized prompts, use
poetry run python3 run.py --config=data/configs/commitpack_prioritize.yaml
poetry run python3 run.py --config=data/configs/trip_prioritize.yaml
poetry run python3 run.py --config=data/configs/product_prioritize.yaml To generate new requirements, use
poetry run python3 -m analysis.elicitation To generate new prompts, use
poetry run python3 -m analysis.prompt_gen To generate new evaluators, use
poetry run python3 -m analysis.judge