Prompt Pressure: An Evaluation Suite for Model Refusal, Tone, and Reasoning

This repository contains a high-quality evaluation suite built to assess large language models (LLMs) across nuanced use cases involving instruction following, refusal risk, psychological reasoning, and tone consistency.

Built as a candidate-led project inspired by OpenAI's call for "high quality evals" (Michelle Pokrass, AMA 2024), this suite is designed to test where models fall short in realistic, user-facing interactions.

Project Overview

Prompt Categories:
- Refusal Sensitivity
- Instruction Following
- Psychological Reasoning
- Tone & Role Consistency
- Emergent Story Logic
Files Included:
- evals_dataset.json: JSON test suite of 50 prompts + scoring criteria
- eval_scores_output.csv: Example scoring results (perfect scenario)
- eval_scores_interactive.csv: Manual scoring loop output (simulated)
- eval_scores_batch.csv: Model vs. model comparison (GPT-4, Hermes 3, etc.)

Scoring System

Each test includes:

A prompt (input)
A set of evaluation criteria (eval_criteria)
Expected behavior (expected_behavior)

Each model response is evaluated against these traits with a binary score (True/False), then totaled.

Why This Matters

Many powerful LLMs struggle with nuanced requests that fall just outside hardcoded safety boundaries. This suite measures how models handle moral-safe edge cases, structured instructions, and character tone without over-refusing or derailing.

This project demonstrates:

Real user pain points in model interaction
Scalable and structured eval design
Manual and automated scoring options

Inspired By

“New high quality evals are always impressive. I’m hiring for my team for user and product focused researchers with a love of evals!”
– Michelle Pokrass, API Research Lead @ OpenAI

License

MIT License. Feel free to fork, extend, and build with it.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
changelog_v1_2.md		changelog_v1_2.md
config.yaml		config.yaml
config_lmstudio.yaml		config_lmstudio.yaml
eval_scores_output_v1_2.csv		eval_scores_output_v1_2.csv
evals_dataset.json		evals_dataset.json
evals_dataset_v1_2.json		evals_dataset_v1_2.json
groq_batch_eval.py		groq_batch_eval.py
run_eval.py		run_eval.py
run_promptpressure_dynamic.bat		run_promptpressure_dynamic.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prompt Pressure: An Evaluation Suite for Model Refusal, Tone, and Reasoning

Project Overview

Scoring System

Why This Matters

Inspired By

License

About

Uh oh!

Releases 2

Packages

Languages

StressTestor/PromptPressure-EvalSuite

Folders and files

Latest commit

History

Repository files navigation

Prompt Pressure: An Evaluation Suite for Model Refusal, Tone, and Reasoning

Project Overview

Scoring System

Why This Matters

Inspired By

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages