Skip to content
This repository was archived by the owner on Feb 3, 2025. It is now read-only.

Awesome deliberative prompting: How to ask LLMs to produce reliable reasoning and make reason-responsive decisions.

License

Notifications You must be signed in to change notification settings

logikon-ai/awesome-deliberative-prompting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Deliberative Prompting Awesome

Note

Deliberative prompting, chain-of-thought, self-reflection and thinking have become mainstream techniques in AI. This archived opinionated reading lists documents the journey the community has taken to achieve this feat in less than 4 years, from the beginnings in 2021 to January 2025, when Deepseek R1 has been released. Thanks for following.

How to ask Large Language Models (LLMs) to produce reliable reasoning and make reason-responsive decisions.

deliberation, n.

The action of thinking carefully about something, esp. in order to reach a decision; careful consideration; an act or instance of this. (OED)

Contents

Success Stories

Striking evidence for effectiveness of deliberative prompting.

  • πŸŽ“ One of the first attempts to elicit reasoning traces from LLMs to improve performance, includes experiments with GPT-2. "Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2." 2021-03-24. [>paper]
  • πŸŽ“ The original "chain of though" (CoT) paper, first to give clear evidence that deliberative prompting works. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." 2022-01-28. [>paper]
  • πŸŽ“ Deliberative prompting improves ability of Google's LLMs to solve unseen difficult problems, and instruction-finetuned (Flan-) models are much better at it.
    • "Scaling Instruction-Finetuned Language Models." 2022-12-06. [>paper]
    • "PaLM 2 Technical Report." 2023-05-17. [>paper]
  • πŸŽ“ Deliberative prompting is highly effective for OpenAI's models (Text-Davinci-003, ChatGPT, GPT-4), increasing accuracy in many (yet not all) reasoning tasks in the EvalAGI benchmark. "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models." 2023-04-13. [>paper]
  • πŸŽ“ Deliberative prompting unlocks latent cognitive skills and is more effective for bigger models. "Challenging BIG-Bench tasks and whether chain-of-thought can solve them." 2022-10-17. [>paper]
  • πŸŽ“ Experimentally introducing errors in CoT reasoning traces decreases decision accuracy, which provides indirect evidence for reason-responsiveness of LLMs. "Stress Testing Chain-of-Thought Prompting for Large Language Models." 2023-09-28. [>paper]
  • πŸŽ“ Reasoning (about retrieval candidates) improves RAG. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." 2023-10-17. [>paper]
  • πŸŽ“ Deliberative reading notes improve RAG. "Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models." 2023-11-15. [>paper]
  • πŸŽ“ Good reasoning (CoT) causes good answers (i.e., LLMs are reason-responsive). "Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems." 2023-12-07. [>paper]
  • πŸŽ“ Logical interpretation of internal layer-wise processing of reasoning tasks yields further evidence for reason-responsiveness. "Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Model." 2023-12-07. [>paper]
  • πŸŽ“ Reasoning about alternative drafts improves text generation. "Self-Evaluation Improves Selective Generation in Large Language Models." 2023-12-14. [>paper]
  • πŸŽ“ CoT with carefully retrieved, diverse reasoning demonstrations boosts multi-modal LLMs. "Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models." 2023-12-04. [>paper]
  • πŸŽ“ Effective multi-hop CoT for visual question answering. "II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering." 2024-02-16. [>paper]
  • πŸŽ“ πŸ‘©β€πŸ’» DPO on synthetic CoT traces increases reason-responsiveness of small LLMs. "Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning" 2024-02-23. [>paper] [>code]
  • πŸŽ“ The impressive Deepseek R1 demonstrates that LLMs can learn effective problem solving, reflection, self-validation and self-correction through RL alone. "Deepseek R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" 2025-02-03. [>paper]

Prompting Patterns and Strategies

Prompting strategies and patterns to make LLMs deliberate.

Beyond "Let's think step by step"

Instructing LLMs to reason (in a specific way).

  • πŸŽ“ Asking GPT-4 to provide a correct and a wrong answers boosts accuracy. "Large Language Models are Contrastive Reasoners." 2024-03-13. [>paper]
  • πŸ”₯πŸŽ“ Guided dynamic prompting increases GPT-4 CoT performance by up to 30 percentage points. "Structure Guided Prompt: Instructing Large Language Model in Multi-Step Reasoning by Exploring Graph Structure of the Text" 2024-02-20. [>paper]
  • πŸŽ“ Letting LLMs choose and combine reasoning strategies is cost-efficient and improves performance. "SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures." 2024-02-06. [>paper]
  • πŸŽ“ CoA: Produce an abstract reasoning trace first, and fill in the details (using tools) later. "Efficient Tool Use with Chain-of-Abstraction Reasoning." 2024-01-30. [>paper]
  • πŸŽ“ Reason over and over again until verification test is passed. "Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts." 2023-10-23. [>paper]
  • πŸŽ“ Generate multiple diverse deliberations, then synthesize those in a single reasoning path. "Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios." 2023-11-14. [>paper]
  • πŸŽ“ Survey of CoT regarding task types, prompt designs, and reasoning quality metrics. "Towards Better Chain-of-Thought Prompting Strategies: A Survey." 2023-10-08. [>paper]
  • πŸŽ“ Asking a LLM about a problem's broader context leads to better answers. "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." 2023-10-09. [>paper]
  • Weighing Pros and Cons: This universal deliberation paradigm can be implemented with LLMs.
    • πŸ‘©β€πŸ’» A {{guidance}} program that does: 1. Identify Options β†’ 2. Generate Pros and Cons β†’ 3. Weigh Reasons β†’ 4. Decide. [>code]
  • πŸŽ“ πŸ‘©β€πŸ’» Plan-and-Solve Prompting. "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." 2023-05-06. [>paper] [>code]
  • πŸŽ“ Note-Taking. "Learning to Reason and Memorize with Self-Notes." 2023-05-01. [>paper]
  • πŸŽ“ Deliberate-then-Generate improves text quality. "Deliberate then Generate: Enhanced Prompting Framework for Text Generation." 2023-05-31. [>paper]
  • πŸŽ“ Make LLM spontaneously interleave reasoning and Q/A. "ReAct: Synergizing Reasoning and Acting in Language Models." 2022-10-06. [>paper]
  • πŸŽ“ 'Divide-and-Conquer' instructions substantially outperform standard CoT. "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models" 2022-05-21. [>paper]

Multi-Agent Deliberation

Let one (or many) LLMs simulate a free controversy.

  • πŸŽ“Β πŸ‘©β€πŸ’»Β Carefully selected open LLMs that iteratively review and improve their answers outperform GPT4-o. "Mixture-of-Agents Enhances Large Language Model Capabilities." 2024-06-10. [>paper] [>code]
  • πŸŽ“Β More elaborate and costly multi-agent-system designs are typically more effective, according to this review: "Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A." 2023-11-19. [>paper]
  • πŸŽ“Β Systematic peer review is even better than multi-agent debate. "Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration." 2023-11-14. [>paper]
  • πŸŽ“Β Collective critique and reflection reduce factual hallucinations and toxicity. "N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics." 2023-10-28. [>paper]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β Delphi-process with diverse LLMs is veristically more valuable than simple debating. "ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs." 2023-09-22. [>paper] [>code]
  • πŸŽ“Β Multi-agent debate increases cognitive diversity increases performance. "Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate." 2023-05-30. [>paper]
  • πŸŽ“Β Leverage wisdom of the crowd effects through debate simulation. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." 2023-05-23. [>paper]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β Emulate Socratic dialogue to collaboratively solve problems with multiple AI agents. "The Socratic Method for Self-Discovery in Large Language Models." 2023-05-05. [>blog] [>code]

Reflection and Meta-Cognition

Higher-order reasoning strategies that may improve first-order deliberation.

  • πŸŽ“ πŸ‘©β€πŸ’» Keeping track of general insights gained from CoT problem solving improves future accuracy and efficiency. "Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models." 2024-06-06. [>paper] [>code]
  • πŸŽ“ πŸ‘©β€πŸ’» Processing task in function of self-assessed difficulty boosts CoT effectiveness. "Divide and Conquer for Large Language Models Reasoning." 2024-01-10. [>paper] [>code]
  • πŸŽ“ πŸ‘©β€πŸ’»Β Reflecting on task allows LLM to autogenerate more effective instructions, demonstration, and reasoning traces. "Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models." 2023-10-11. [>paper] [>code]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β LLM-based AI Instructor devises effective first-order CoT-instructions (open source models improve by up to 20%). "Agent Instructs Large Language Models to be General Zero-Shot Reasoners." 2023-10-05. [>paper] [>code]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β Clarifyβ†’Judgeβ†’Evaluateβ†’Confirmβ†’Qualify Paradigm. "Metacognitive Prompting Improves Understanding in Large Language Models." 2023-08-10. [>paper] [>code]
  • πŸŽ“ πŸ‘©β€πŸ’»Β Find-then-simulate-an-expert-for-this-problem Strategy. "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm." 2021-02-15. [>paper] [>lmql]

Text Generation Techniques

Text generation techniques, which can be combined with prompting patterns and strategies.

  • πŸ”₯πŸŽ“ Iterative revision of reasoning in light of previous CoT traces improves accuracy by 10-20%. "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation". 2024-03-08. [>paper]
  • πŸŽ“ Pipeline for self-generating & choosing effective CoT few-shot demonstrations. "Universal Self-adaptive Prompting". 2023-05-24. [>paper]
  • πŸŽ“ More reasoning (= longer reasoning traces) is better. "The Impact of Reasoning Step Length on Large Language Models". 2024-01-10. [>paper]
  • πŸŽ“ Having (accordingly labeled) correct and erroneous (few-shot) reasoning demonstrations improves CoT. "Contrastive Chain-of-Thought Prompting." 2023-11-17. [>paper]
  • πŸŽ“ Better problem-solving and deliberation through few-shot trial-and-error (in-context RL). "Reflexion: Language Agents with Verbal Reinforcement Learning." 2023-03-20. [>paper]
  • πŸŽ“ External guides that constrain generation of reasoning improve accuracy by up to 35% on selected tasks. "Certified Reasoning with Language Models." 2023-06-06. [>paper]
  • πŸŽ“ πŸ‘©β€πŸ’» Highly effective beam search for generating complex, multi-step reasoning episodes. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." 2023-05-17. [>paper] [>code]
    • πŸ‘©β€πŸ’» A minimalistic implementation of Tree-of-Thoughts as plain prompt. [>code]
    • πŸ‘©β€πŸ’» An experimental LMQL implementation of Tree-of-Thoughts. [>code]
  • πŸŽ“ πŸ‘©β€πŸ’» LLM auto-generates diverse reasoning demonstration to-be-used in deliberative prompting. "Automatic Chain of Thought Prompting in Large Language Models." 2022-10-07. [>paper] [>code]

Self-Correction

Let LLMs self-correct their deliberation.

  • πŸŽ“Β Consistency between multiple CoT-traces is an indicator of reasoning reliability, which can be exploited for self-check / aggregation. "Can We Verify Step by Step for Incorrect Answer Detection?" 2024-02-16. [>paper]
  • πŸŽ“Β Turn LLMs into intrinsic self-checkers by appending self-correction steps to standard CoT traces for finetuning. "Small Language Model Can Self-correct." 2024-01-14. [>paper]
  • πŸŽ“Β Reinforced Self-Training improves retrieval-augmented multi-hop Q/A. "ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent." 2023-12-15. [>paper]
  • πŸŽ“Β Conditional self-correction depending on whether critical questions have been addressed in reasoning trace. "The ART of LLM Refinement: Ask, Refine, and Trust." 2023-11-14. [>paper]
  • πŸŽ“Β Iteratively refining reasoning given diverse feedback increases accuaracy by up tp 10% (ChatGPT). "MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models." 2023-10-19. [>paper]
  • πŸŽ“Β Instructing a model just to "review" its answer and "find problems" doesn't lead to effective self-correction. "Large Language Models Cannot Self-Correct Reasoning Yet." 2023-09-25. [>paper]
  • πŸŽ“Β LLMs can come up with, and address critical questions to improve their drafts. "Chain-of-Verification Reduces Hallucination in Large Language Models." 2023-09-25. [>paper]
  • πŸŽ“ LogiCoT: Self-check and revision after each CoT step improves performance (for selected tasks and models). "Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic." 2023-09-23. [>paper]
  • πŸŽ“ Excellent review about self-correcting LLMs, with application to unfaithful reasoning. "Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies." 2023-08-06. [>paper]

Reasoning Analytics

Methods for analysing LLM deliberation and assessing reasoning quality.

  • πŸŽ“πŸ‘©β€πŸ’» Comprehensive LLM-based reasoning analytics that breaks texts down into individual reasons. "DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models." 2024-01-04. [>paper] [>code]
  • πŸŽ“πŸ€— Highly performant, open LLM (T5-based) for inference verification. "Minds versus Machines: Rethinking Entailment Verification with Language Models." 2024-02-06. [>paper] [>model]
  • πŸŽ“πŸ‘©β€πŸ’» Test dataset for CoT evaluators. "A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains." 2023-11-23. [>paper] [>dataset]
  • πŸŽ“πŸ‘©β€πŸ’» Framework for evaluating reasoning chains by viewing them as informal proofs that derive the final answer. "ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness." 2023-11-23. [>paper] [>code]
  • πŸŽ“ GPT-4 is 5x better at predicting whether math reasoning is correct than GPT-3.5. "Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs." 2023-12-28. [>paper]
  • πŸŽ“ Minimalistic GPT-4 prompts for assessing reasoning quality. "SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation." 2023-09-29. [>paper] [>code]
  • πŸŽ“πŸ‘©β€πŸ’» Automatic, semantic-similarity based metrics for assessing CoT traces (redundancy, faithfulness, consistency, etc.). "ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning." 2023-09-12. [>paper]

Limitations, Failures, Puzzles

Things that don't work, or are poorly understood.

  • πŸŽ“ Structured generation risks to degrade reasoning quality and CoT effectiveness. "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models." 2024-08-05. [>paper]
  • πŸŽ“ Filler tokens can be as effective as sound reasoning traces for eliciting correct answers. "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models." 2024-04-24. [>paper]
  • πŸ”₯πŸŽ“ Causal analysis shows that LLMs sometimes ignore CoT traces, but reason responsiveness increases with model size, and is shaped by fine-tuning. "LLMs with Chain-of-Thought Are Non-Causal Reasoners" 2024-02-25. [>paper]
  • πŸŽ“ Bad reasoning may lead to correct conclusions, hence better methods for CoT evaluation are needed. "SCORE: A framework for Self-Contradictory Reasoning Evaluation." 2023-11-16. [>paper]
  • πŸŽ“ LLMs may produce "encoded reasoning" that's unintelligable to humans, which may nullify any XAI gains from deliberative prompting. "Preventing Language Models From Hiding Their Reasoning." 2023-10-27. [>paper]
  • πŸŽ“ LLMs judge and decide in function of available arguments (reason-responsiveness), but are more strongly influenced by fallacious and deceptive reasons as compared to sound ones. "How susceptible are LLMs to Logical Fallacies?" 2023-08-18. [>paper]
  • πŸŽ“ Incorrect reasoning improves answer accuracy (nearly) as much as correct one. "Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting." 2023-07-20. [>paper]
  • πŸŽ“ Zeroshot CoT reasoning in sensitive domains increases a LLM's likelihood to produce harmful or undesirable output. "On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning." 2023-06-23. [>paper]
  • πŸŽ“ LLMs may systematically fabricate erroneous CoT rationales for wrong answers, NYU/Anthropic team finds. "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting." 2023-05-07. [>paper]
  • πŸŽ“ LLMs' practical deliberation is not robust, but easily let astray by re-wording scenarios. "Despite 'super-human' performance, current LLMs are unsuited for decisions about ethics and safety" 2022-12-13. [>paper]

Datasets

Datasets containing examples of deliberative prompting, potentially useful for training models / assessing their deliberation skills.

  • Instruction-following dataset augmented with "reasoning traces" generated by LLMs.
    • πŸŽ“ ORCA - Microsoft's original paper. "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." 2023-06-05. [>paper]
    • πŸ‘©β€πŸ’» OpenOrca - Open source replication of ORCA datasets. [>dataset]
    • πŸ‘©β€πŸ’» Dolphin - Open source replication of ORCA datasets. [>dataset]
    • πŸŽ“ ORCA 2 - Improved Orca by Microsoft, e.g. with meta reasoning. "Orca 2: Teaching Small Language Models How to Reason." 2023-11-18. [>paper]
  • πŸŽ“πŸ‘©β€πŸ’» CoT Collection - 1.84 million reasoning traces for 1,060 tasks. "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning." [>paper] [>code]
  • πŸ‘©β€πŸ’» OASST1 - contains more than 200 instructions to generate pros and cons (acc. to nomic.ai's map). [>dataset]
  • πŸŽ“ LegalBench - a benchmark for legal reasoning in LLMs [>paper]
  • πŸŽ“πŸ‘©β€πŸ’» ThoughtSource - an open resource for data and tools related to chain-of-thought reasoning in large language models. [>paper] [>code]
  • πŸŽ“πŸ‘©β€πŸ’» Review with lots of hints to CoT relevant datasets. "Datasets for Large Language Models: A Comprehensive Survey" [>paper] [>code]
  • πŸ‘©β€πŸ’» Maxime Labonne's LLM datasets list [github]

Tools and Frameworks

Tools and Frameworks to implement deliberative prompting.

  • πŸ‘©β€πŸ’» LMQL - a programming language for language model interaction. [>site] GitHub Repo stars
    • πŸ‘©β€πŸ’» Interactive LMQL Playground [>site]
    • πŸŽ“ "Prompting Is Programming: A Query Language for Large Language Models." 2022-12-12. [>paper]
  • πŸ‘©β€πŸ’» {{guidance}} - a language for controlling large language models. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» outlines ~ - a language for guided text generation. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» DSPy - a programmatic interface to LLMs. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» llm-reasoners – A library for advanced large language model reasoning. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» ThinkGPT - framework and building blocks for chain-of-thought workflows. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» LangChain - a python library for building LLM chains and agents. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» PromptBench -a unified library for evaluating LLMS, inter alia effectiveness of CoT prompts. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» SymbolicAI - a library for compositional differentiable programming with LLMs. [>code] GitHub Repo stars

Other Resources

More awesome and useful material.

  • πŸ“š Survey of Autonomous LLM Agents (continuously updated). [>site]
  • πŸ‘©β€πŸ’» LLM Dashboard - explore task-specific reasoning performance of open LLMs [>app]
  • πŸ“š Prompt Engineering Guide set up by DAIR. [>site]
  • πŸ“š ATLAS - principles and benchmark for systematic prompting [>code]
  • πŸ“š Deliberative Prompting Guide set up by Logikon. [>site]
  • πŸ“š Arguing with Arguments – recent and wonderful piece by H. Siegel discussing what it actually means to evaluate an argument. [>paper]

About

Awesome deliberative prompting: How to ask LLMs to produce reliable reasoning and make reason-responsive decisions.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published