Skip to content

MinghuiChen43/awesome-trustworthy-deep-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Maintenance PR Welcome  GitHub stars GitHub watchers GitHub forks GitHub Contributors

Awesome Trustworthy Deep Learning Awesome

The deployment of deep learning in real-world systems calls for a set of complementary technologies that will ensure that deep learning is trustworthy (Nicolas Papernot). The list covers different topics in emerging research areas including but not limited to out-of-distribution generalization, adversarial examples, backdoor attack, model inversion attack, machine unlearning, etc.

Daily updating from ArXiv. The preview README only includes papers submitted to ArXiv within the last one year. More paper can be found here 📂 [Full List].

avatar

Table of Contents

Paper List

Survey

📂 [Full List of Survey].

Out-of-Distribution Generalization

📂 [Full List of Out-of-Distribution Generalization].

  • Intermediate Layer Classifiers for OOD generalization. [paper]
    • Arnas Uselis, Seong Joon Oh.
    • Key Word: Out-of-Distribution Generalization.
    • Digest This paper challenges the common practice of using penultimate layer features for out-of-distribution (OOD) generalization. The authors introduce Intermediate Layer Classifiers (ILCs) and find that earlier-layer representations often generalize better under distribution shifts. In some cases, zero-shot performance from intermediate layers rivals few-shot performance from the last layer. Their results across datasets and models suggest intermediate layers are more robust to shifts, underscoring the need to rethink layer selection for OOD tasks.

Evasion Attacks and Defenses

📂 [Full List of Evasion Attacks and Defenses].

  • REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective. [paper]

    • Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann.
    • Key Word: Adversarial Attacks; Large Language Models; Reinforcement Learning.
    • Digest This paper critiques existing adversarial attacks on LLMs that maximize the likelihood of an affirmative response, arguing that such methods overestimate model robustness. To improve attack efficacy, the authors propose an adaptive, semantic optimization approach using a REINFORCE-based objective. Applied to Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD) jailbreak attacks, their method significantly enhances attack success rates, doubling ASR on Llama3 and increasing ASR from 2% to 50% against circuit breaker defenses.
  • Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. [paper]

    • Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez.
    • Key Word: Red Teaming; Jailbreak.
    • Digest This paper introduces Constitutional Classifiers, a defense against universal jailbreaks in LLMs. These classifiers are trained on synthetic data generated using natural language rules to enforce content restrictions. Extensive red teaming and automated evaluations show that the approach effectively blocks jailbreaks while maintaining practical deployment viability, with minimal refusal rate increase (0.38%) and a 23.7% inference overhead. The findings demonstrate that robust jailbreak defenses can be achieved without significantly compromising usability.

Poisoning Attacks and Defenses

📂 [Full List of Poisoning Attacks and Defenses].

Privacy

📂 [Full List of Privacy].

  • Existing Large Language Model Unlearning Evaluations Are Inconclusive. [paper]

    • Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter.
    • Key Word: Machine Unlearning; Large Language Model.
    • Digest This paper critiques current evaluation methods in machine unlearning for language models, revealing that they often misrepresent unlearning success due to three flaws: (1) evaluations may reintroduce knowledge during testing, (2) results vary widely across tasks, and (3) reliance on spurious correlations undermines trust. To improve reliability, the authors propose two guiding principles—minimal information injection and downstream task awareness—and validate them through experiments showing how current practices can lead to misleading conclusions.
  • Extracting memorized pieces of (copyrighted) books from open-weight language models. [paper]

    • A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang.
    • Key Word: Extraction Attack.
    • Digest This paper examines how much large language models (LLMs) memorize copyrighted content, using probabilistic extraction techniques on 13 open-weight LLMs. It finds that while memorization varies across models and texts, some models—like Llama 3.1 70B—can nearly fully memorize certain books (e.g., Harry Potter, 1984). However, most models do not memorize most books. The findings complicate copyright debates, offering evidence for both sides without clearly favoring either.
  • When to Forget? Complexity Trade-offs in Machine Unlearning. [paper]

    • Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia, Kevin Scaman.
    • Key Word: Certified Unlearning.
    • Digest This paper analyzes the efficiency of Machine Unlearning (MU) and establishes the first minimax upper and lower bounds on unlearning computation time. Under strongly convex objectives and without access to forgotten data, the authors introduce the unlearning complexity ratio, comparing unlearning costs to full retraining. A phase diagram reveals three regimes: infeasibility, trivial unlearning via noise, and significant computational savings. The study highlights key factors—data dimensionality, forget set size, and privacy constraints—that influence the feasibility of efficient unlearning.
  • Open Problems in Machine Unlearning for AI Safety. [paper]

    • Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal.
    • Key Word: Machine Unlearning.
    • Digest As AI systems grow in capability and autonomy in critical areas like cybersecurity, healthcare, and biological research, ensuring their alignment with human values is crucial. Machine unlearning, originally focused on privacy and data removal, is gaining attention for its potential in AI safety. However, this paper identifies significant limitations preventing unlearning from fully addressing safety concerns, especially in managing dual-use knowledge where information can have both beneficial and harmful applications. It highlights challenges such as unintended side effects, conflicts with existing safety mechanisms, and difficulties in evaluating robustness and preserving safety features during unlearning. By outlining these constraints and open problems, the paper aims to guide future research toward more realistic and effective AI safety strategies.

Fairness

📂 [Full List of Fairness].

Interpretability

📂 [Full List of Interpretability].

  • Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations. [paper]

    • Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju.
    • Key Word: Sparse Autoencoders; Adversarial Attack.
    • Digest This paper highlights a critical weakness in sparse autoencoders (SAEs) used to interpret LLMs: their concept representations are not robust to small input perturbations. The authors introduce an evaluation framework that uses adversarial attacks to test this robustness and find that SAE interpretations can be easily manipulated without changing the LLM’s output, questioning their reliability for model monitoring and oversight tasks.
  • Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts. [paper]

    • Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik.
    • Key Word: Concept Bottleneck Models; Distribution Shifts.
    • Digest This paper studies how concept-based models (CMs) behave on out-of-distribution (OOD) inputs, especially under concept interventions (where humans correct predicted concepts at test time). The authors identify a flaw called leakage poisoning, where CMs fail to improve after intervention on OOD data. To address this, they propose MixCEM, a model that selectively uses leaked information only for in-distribution inputs. Experiments show MixCEM improves accuracy on both in-distribution and OOD samples, with and without interventions.
  • MIB: A Mechanistic Interpretability Benchmark. [paper]

    • Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov.
    • Key Word: Mechanistic Interpretability; Benchmark.
    • Digest The paper introduces MIB, a benchmark designed to evaluate mechanistic interpretability methods in neural language models. MIB has two tracks: circuit localization (identifying model components critical to task performance) and causal variable localization (identifying hidden features representing task-relevant variables). Experiments show that attribution and mask optimization methods excel at circuit localization, while supervised DAS outperforms others in causal variable localization. Surprisingly, SAE features offer no advantage over standard neurons. MIB thus provides a robust framework for assessing real progress in interpretability.
  • SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability. [paper]

    • Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda.
    • Key Word: Sparse Autoencoder; Benchmark.
    • Digest The paper introduces SAEBench, a comprehensive evaluation suite for sparse autoencoders (SAEs) that assesses their performance across seven diverse metrics, including interpretability, feature disentanglement, and practical applications like unlearning. It highlights that improvements in traditional unsupervised proxy metrics do not always lead to better real-world performance. The authors open-source over 200 SAEs spanning eight architectures and training algorithms, revealing that Matryoshka SAEs, despite underperforming on proxy metrics, excel in feature disentanglement, especially at scale. SAEBench provides a standardized framework for comparing SAE designs and studying scaling trends in their development.
  • Towards Understanding Distilled Reasoning Models: A Representational Approach. [paper]

    • David D. Baek, Max Tegmark.
    • Key Word: Mechanistic Interpretability; Model Distillation; Model Steering.
    • Digest This paper examines the impact of model distillation on reasoning feature development in large language models (LLMs). Using a crosscoder trained on Qwen-series models, the study finds that distillation creates unique reasoning feature directions, enabling control over thinking styles (e.g., over-thinking vs. incisive-thinking). The analysis covers four reasoning types: self-reflection, deductive, alternative, and contrastive reasoning. Additionally, the study explores changes in feature geometry, suggesting that larger distilled models develop more structured representations, improving distillation performance. These findings enhance understanding of distillation’s role in shaping model reasoning and transparency.
  • From superposition to sparse codes: interpretable representations in neural networks. [paper]

    • David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane.
    • Key Word: Superposition; Sparse Coding.
    • Digest This paper explores how neural networks represent information, proposing that they encode features in superposition—linearly overlaying input concepts. The authors introduce a three-step framework to extract interpretable representations: (1) Identifiability theory shows that neural networks recover latent features up to a linear transformation; (2) Sparse coding techniques disentangle these features using compressed sensing principles; (3) Interpretability metrics evaluate alignment with human-interpretable concepts. By integrating insights from neuroscience, representation learning, and interpretability research, the paper offers a perspective with implications for neural coding, AI transparency, and deep learning interpretability.
  • Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry. [paper]

    • Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba.
    • Key Word: Sparse Autoencoder.
    • Digest This paper examines the limitations of Sparse Autoencoders (SAEs) in interpreting neural network representations. It introduces a bilevel optimization framework showing that SAEs impose structural biases, affecting which concepts they can detect. Different SAE architectures are not interchangeable, as switching them can reveal or obscure concepts. Through experiments on toy models, semi-synthetic data, and large-scale datasets, the study highlights two key properties of real-world concepts: varying intrinsic dimensionality and nonlinear separability. Standard SAEs fail when these factors are ignored, but a new SAE design incorporating them uncovers previously hidden concepts. The findings challenge the notion of a universal SAE and emphasize the importance of architecture-specific choices in interpretability.
  • Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? [paper]

    • Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard.
    • Key Word: Mechanistic Interpretability.
    • Digest This work explores the identifiability of Mechanistic Interpretability (MI) explanations in neural networks. It examines whether unique explanations exist for a given behavior by drawing parallels to identifiability in statistics. The study identifies two MI strategies: “where-then-what” (isolating circuits before interpreting) and “what-then-where” (starting with candidate algorithms and finding neural activation subspaces). Experiments on Boolean functions and small MLPs reveal systematic non-identifiability—multiple circuits, interpretations, and subspaces can explain the same behavior. The study questions whether uniqueness is necessary, suggesting that predictive and manipulability criteria might suffice, and discusses validation through the inner interpretability framework.
  • Open Problems in Mechanistic Interpretability. [paper]

    • Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath.
    • Key Word: Mechanistic Interpretability.
    • Digest This review explores the current challenges and open problems in mechanistic interpretability, which seeks to understand the computational mechanisms behind neural networks. While progress has been made, further conceptual and practical advancements are needed to deepen insights, refine applications, and address socio-technical challenges. The paper highlights key areas for future research to enhance AI transparency, safety, and scientific understanding of intelligence.
  • Sparse Autoencoders Do Not Find Canonical Units of Analysis. [paper]

    • Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda.
    • Key Word: Mechanistic Interpretability; Sparse Autoencoders; Representational Structure.
    • Digest This paper challenges the assumption that Sparse Autoencoders (SAEs) can identify a canonical set of atomic features in LLMs. Using SAE stitching, the authors show that SAEs are incomplete, as larger SAEs contain novel latents not captured by smaller ones. Through meta-SAEs, they demonstrate that SAE latents are not atomic, as they often decompose into smaller, interpretable components (e.g., “Einstein” → “scientist” + “Germany” + “famous person”). While SAEs may still be useful, the authors suggest rethinking their role in mechanistic interpretability and exploring alternative methods for finding fundamental features. An interactive dashboard is provided for further exploration.

Alignment

📂 [Full List of Alignment].

  • Scaling Laws For Scalable Oversight. [paper]

    • Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark.
    • Key Word: Scalable Oversight.
    • Digest This paper proposes a framework to model and quantify scalable oversight—how weaker AI systems supervise stronger ones—using oversight and deception-specific Elo scores. The framework is validated through games like Nim, Mafia, and Debate, revealing how oversight success scales with capability gaps. They further study Nested Scalable Oversight (NSO) and find that success rates drop sharply when overseeing much stronger systems, with a success rate below 52% at a 400 Elo gap.
  • You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation. [paper]

    • Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, George Wang, Liam Carroll, Daniel Murfet.
    • Key Word: AI Alignment.
    • Digest This paper argues that understanding the relationship between data distribution structure and model structure is key to AI alignment. It highlights that neural networks with identical training performance can generalize differently due to internal computational differences, making standard evaluation methods insufficient for safety assurances. To advance AI alignment, the authors propose developing statistical foundations to systematically analyze how these structures influence generalization.

Others

📂 [Full List of Others].

  • The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems. [paper]

    • Richard Ren, Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks.
    • Key Word: Honesty; Benchmark.
    • Digest This paper addresses concerns about honesty in large language models (LLMs), distinguishing it from accuracy. Current honesty evaluations are limited, often conflating honesty with correctness. To address this, the authors introduce a large-scale, human-collected dataset that directly measures honesty. Their findings reveal that while larger models achieve higher accuracy, they do not necessarily become more honest. Notably, frontier LLMs, despite excelling in truthfulness benchmarks, often lie under pressure. The study also demonstrates that simple interventions, such as representation engineering, can enhance honesty, highlighting the need for robust evaluations and interventions to ensure trustworthy AI.
  • Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? [paper]

    • Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King.
    • Key Word: Scientist AI; Agentic AI; AI Safety.
    • Digest The paper discusses the risks posed by generalist AI agents, which can autonomously plan, act, and pursue goals. These risks include deception, misalignment with human interests, and loss of human control. The authors argue for a shift away from agency-driven AI towards a non-agentic AI system called Scientist AI, designed to explain the world rather than act in it. Scientist AI consists of a world model that generates theories and a question-answering system, both incorporating uncertainty to prevent overconfidence. This approach aims to advance scientific progress and AI safety while mitigating risks associated with autonomous AI agents.
  • Do Large Language Model Benchmarks Test Reliability? [paper]

    • Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry.
    • Key Word: Large Language Model Benchmark; Reliability.
    • Digest This paper highlights the lack of focus on LLM reliability in existing benchmarks, despite extensive efforts to track model capabilities. The authors identify pervasive label errors in current benchmarks, which obscure model failures and unreliable behavior. To address this, they introduce platinum benchmarks—carefully curated datasets with minimal label errors and ambiguity. By refining examples from 15 popular benchmarks and evaluating various models, they find that even frontier LLMs struggle with basic tasks, such as elementary math problems, revealing systematic failure patterns.

Related Awesome Lists

Robustness Lists

Privacy Lists

Fairness Lists

Interpretability Lists

Other Lists

Toolboxes

Robustness Toolboxes

  • DeepDG: OOD generalization toolbox

    • A domain generalization toolbox for research purpose.
  • Cleverhans

    • This repository contains the source code for CleverHans, a Python library to benchmark machine learning systems' vulnerability to adversarial examples.
  • Adversarial Robustness Toolbox (ART)

    • Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.
  • Adversarial-Attacks-Pytorch

    • PyTorch implementation of adversarial attacks.
  • Advtorch

    • Advtorch is a Python toolbox for adversarial robustness research. The primary functionalities are implemented in PyTorch. Specifically, AdverTorch contains modules for generating adversarial perturbations and defending against adversarial examples, also scripts for adversarial training.
  • RobustBench

    • A standardized benchmark for adversarial robustness.
  • BackdoorBox

    • The open-sourced Python toolbox for backdoor attacks and defenses.
  • BackdoorBench

    • A comprehensive benchmark of backdoor attack and defense methods.

Privacy Toolboxes

  • Diffprivlib

    • Diffprivlib is a general-purpose library for experimenting with, investigating and developing applications in, differential privacy.
  • Privacy Meter

    • Privacy Meter is an open-source library to audit data privacy in statistical and machine learning algorithms.
  • OpenDP

    • The OpenDP Library is a modular collection of statistical algorithms that adhere to the definition of differential privacy.
  • PrivacyRaven

    • PrivacyRaven is a privacy testing library for deep learning systems.
  • PersonalizedFL

    • PersonalizedFL is a toolbox for personalized federated learning.
  • TAPAS

    • Evaluating the privacy of synthetic data with an adversarial toolbox.

Fairness Toolboxes

  • AI Fairness 360

    • The AI Fairness 360 toolkit is an extensible open-source library containing techniques developed by the research community to help detect and mitigate bias in machine learning models throughout the AI application lifecycle.
  • Fairlearn

    • Fairlearn is a Python package that empowers developers of artificial intelligence (AI) systems to assess their system's fairness and mitigate any observed unfairness issues.
  • Aequitas

    • Aequitas is an open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive tools.
  • FAT Forensics

    • FAT Forensics implements the state of the art fairness, accountability and transparency (FAT) algorithms for the three main components of any data modelling pipeline: data (raw data and features), predictive models and model predictions.

Interpretability Toolboxes

  • Lime

    • This project is about explaining what machine learning classifiers (or models) are doing.
  • InterpretML

    • InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof.
  • Deep Visualization Toolbox

    • This is the code required to run the Deep Visualization Toolbox, as well as to generate the neuron-by-neuron visualizations using regularized optimization.
  • Captum

    • Captum is a model interpretability and understanding library for PyTorch.
  • Alibi

    • Alibi is an open source Python library aimed at machine learning model inspection and interpretation.
  • AI Explainability 360

    • The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of datasets and machine learning models.

Other Toolboxes

  • Uncertainty Toolbox

  • Causal Inference 360

    • A Python package for inferring causal effects from observational data.
  • Fortuna

    • Fortuna is a library for uncertainty quantification that makes it easy for users to run benchmarks and bring uncertainty to production systems.
  • VerifAI

    • VerifAI is a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components.

Seminar

Workshops

Robustness Workshops

Privacy Workshops

Fairness Workshops

Interpretability Workshops

Other Workshops

Tutorials

Robustness Tutorials

Talks

Robustness Talks

Blogs

Robustness Blogs

Interpretability Blogs

Other Blogs

Other Resources

Contributing

Welcome to recommend paper that you find interesting and focused on trustworthy deep learning. You can submit an issue or contact me via [email]. Also, if there are any errors in the paper information, please feel free to correct me.

Formatting (The order of the papers is reversed based on the initial submission time to arXiv)

  • Paper Title [paper]
    • Authors. Published Conference or Journal
    • Key Word: XXX.
    • Digest XXXXXX