The deployment of deep learning in real-world systems calls for a set of complementary technologies that will ensure that deep learning is trustworthy (Nicolas Papernot). The list covers different topics in emerging research areas including but not limited to out-of-distribution generalization, adversarial examples, backdoor attack, model inversion attack, machine unlearning, etc.
Daily updating from ArXiv. The preview README only includes papers submitted to ArXiv within the last one year. More paper can be found here 📂 [Full List].
- Awesome Trustworthy Deep Learning Paper List 📃
- Related Awesome Lists 😲
- Toolboxes 🧰
- Seminar ⏰
- Workshops 🔥
- Tutorials 👩🏫
- Talks 🎤
- Blogs ✍️
- Other Resources ✨
- Contributing 😉
📂 [Full List of Out-of-Distribution Generalization].
- Intermediate Layer Classifiers for OOD generalization. [paper]
- Arnas Uselis, Seong Joon Oh.
- Key Word: Out-of-Distribution Generalization.
-
Digest
This paper challenges the common practice of using penultimate layer features for out-of-distribution (OOD) generalization. The authors introduce Intermediate Layer Classifiers (ILCs) and find that earlier-layer representations often generalize better under distribution shifts. In some cases, zero-shot performance from intermediate layers rivals few-shot performance from the last layer. Their results across datasets and models suggest intermediate layers are more robust to shifts, underscoring the need to rethink layer selection for OOD tasks.
📂 [Full List of Evasion Attacks and Defenses].
-
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective. [paper]
- Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann.
- Key Word: Adversarial Attacks; Large Language Models; Reinforcement Learning.
-
Digest
This paper critiques existing adversarial attacks on LLMs that maximize the likelihood of an affirmative response, arguing that such methods overestimate model robustness. To improve attack efficacy, the authors propose an adaptive, semantic optimization approach using a REINFORCE-based objective. Applied to Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD) jailbreak attacks, their method significantly enhances attack success rates, doubling ASR on Llama3 and increasing ASR from 2% to 50% against circuit breaker defenses.
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. [paper]
- Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez.
- Key Word: Red Teaming; Jailbreak.
-
Digest
This paper introduces Constitutional Classifiers, a defense against universal jailbreaks in LLMs. These classifiers are trained on synthetic data generated using natural language rules to enforce content restrictions. Extensive red teaming and automated evaluations show that the approach effectively blocks jailbreaks while maintaining practical deployment viability, with minimal refusal rate increase (0.38%) and a 23.7% inference overhead. The findings demonstrate that robust jailbreak defenses can be achieved without significantly compromising usability.
📂 [Full List of Poisoning Attacks and Defenses].
-
Existing Large Language Model Unlearning Evaluations Are Inconclusive. [paper]
- Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter.
- Key Word: Machine Unlearning; Large Language Model.
-
Digest
This paper critiques current evaluation methods in machine unlearning for language models, revealing that they often misrepresent unlearning success due to three flaws: (1) evaluations may reintroduce knowledge during testing, (2) results vary widely across tasks, and (3) reliance on spurious correlations undermines trust. To improve reliability, the authors propose two guiding principles—minimal information injection and downstream task awareness—and validate them through experiments showing how current practices can lead to misleading conclusions.
-
Extracting memorized pieces of (copyrighted) books from open-weight language models. [paper]
- A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang.
- Key Word: Extraction Attack.
-
Digest
This paper examines how much large language models (LLMs) memorize copyrighted content, using probabilistic extraction techniques on 13 open-weight LLMs. It finds that while memorization varies across models and texts, some models—like Llama 3.1 70B—can nearly fully memorize certain books (e.g., Harry Potter, 1984). However, most models do not memorize most books. The findings complicate copyright debates, offering evidence for both sides without clearly favoring either.
-
When to Forget? Complexity Trade-offs in Machine Unlearning. [paper]
- Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia, Kevin Scaman.
- Key Word: Certified Unlearning.
-
Digest
This paper analyzes the efficiency of Machine Unlearning (MU) and establishes the first minimax upper and lower bounds on unlearning computation time. Under strongly convex objectives and without access to forgotten data, the authors introduce the unlearning complexity ratio, comparing unlearning costs to full retraining. A phase diagram reveals three regimes: infeasibility, trivial unlearning via noise, and significant computational savings. The study highlights key factors—data dimensionality, forget set size, and privacy constraints—that influence the feasibility of efficient unlearning.
-
Open Problems in Machine Unlearning for AI Safety. [paper]
- Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal.
- Key Word: Machine Unlearning.
-
Digest
As AI systems grow in capability and autonomy in critical areas like cybersecurity, healthcare, and biological research, ensuring their alignment with human values is crucial. Machine unlearning, originally focused on privacy and data removal, is gaining attention for its potential in AI safety. However, this paper identifies significant limitations preventing unlearning from fully addressing safety concerns, especially in managing dual-use knowledge where information can have both beneficial and harmful applications. It highlights challenges such as unintended side effects, conflicts with existing safety mechanisms, and difficulties in evaluating robustness and preserving safety features during unlearning. By outlining these constraints and open problems, the paper aims to guide future research toward more realistic and effective AI safety strategies.
📂 [Full List of Interpretability].
-
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations. [paper]
- Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju.
- Key Word: Sparse Autoencoders; Adversarial Attack.
-
Digest
This paper highlights a critical weakness in sparse autoencoders (SAEs) used to interpret LLMs: their concept representations are not robust to small input perturbations. The authors introduce an evaluation framework that uses adversarial attacks to test this robustness and find that SAE interpretations can be easily manipulated without changing the LLM’s output, questioning their reliability for model monitoring and oversight tasks.
-
Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts. [paper]
- Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik.
- Key Word: Concept Bottleneck Models; Distribution Shifts.
-
Digest
This paper studies how concept-based models (CMs) behave on out-of-distribution (OOD) inputs, especially under concept interventions (where humans correct predicted concepts at test time). The authors identify a flaw called leakage poisoning, where CMs fail to improve after intervention on OOD data. To address this, they propose MixCEM, a model that selectively uses leaked information only for in-distribution inputs. Experiments show MixCEM improves accuracy on both in-distribution and OOD samples, with and without interventions.
-
MIB: A Mechanistic Interpretability Benchmark. [paper]
- Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov.
- Key Word: Mechanistic Interpretability; Benchmark.
-
Digest
The paper introduces MIB, a benchmark designed to evaluate mechanistic interpretability methods in neural language models. MIB has two tracks: circuit localization (identifying model components critical to task performance) and causal variable localization (identifying hidden features representing task-relevant variables). Experiments show that attribution and mask optimization methods excel at circuit localization, while supervised DAS outperforms others in causal variable localization. Surprisingly, SAE features offer no advantage over standard neurons. MIB thus provides a robust framework for assessing real progress in interpretability.
-
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability. [paper]
- Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda.
- Key Word: Sparse Autoencoder; Benchmark.
-
Digest
The paper introduces SAEBench, a comprehensive evaluation suite for sparse autoencoders (SAEs) that assesses their performance across seven diverse metrics, including interpretability, feature disentanglement, and practical applications like unlearning. It highlights that improvements in traditional unsupervised proxy metrics do not always lead to better real-world performance. The authors open-source over 200 SAEs spanning eight architectures and training algorithms, revealing that Matryoshka SAEs, despite underperforming on proxy metrics, excel in feature disentanglement, especially at scale. SAEBench provides a standardized framework for comparing SAE designs and studying scaling trends in their development.
-
Towards Understanding Distilled Reasoning Models: A Representational Approach. [paper]
- David D. Baek, Max Tegmark.
- Key Word: Mechanistic Interpretability; Model Distillation; Model Steering.
-
Digest
This paper examines the impact of model distillation on reasoning feature development in large language models (LLMs). Using a crosscoder trained on Qwen-series models, the study finds that distillation creates unique reasoning feature directions, enabling control over thinking styles (e.g., over-thinking vs. incisive-thinking). The analysis covers four reasoning types: self-reflection, deductive, alternative, and contrastive reasoning. Additionally, the study explores changes in feature geometry, suggesting that larger distilled models develop more structured representations, improving distillation performance. These findings enhance understanding of distillation’s role in shaping model reasoning and transparency.
-
From superposition to sparse codes: interpretable representations in neural networks. [paper]
- David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane.
- Key Word: Superposition; Sparse Coding.
-
Digest
This paper explores how neural networks represent information, proposing that they encode features in superposition—linearly overlaying input concepts. The authors introduce a three-step framework to extract interpretable representations: (1) Identifiability theory shows that neural networks recover latent features up to a linear transformation; (2) Sparse coding techniques disentangle these features using compressed sensing principles; (3) Interpretability metrics evaluate alignment with human-interpretable concepts. By integrating insights from neuroscience, representation learning, and interpretability research, the paper offers a perspective with implications for neural coding, AI transparency, and deep learning interpretability.
-
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry. [paper]
- Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba.
- Key Word: Sparse Autoencoder.
-
Digest
This paper examines the limitations of Sparse Autoencoders (SAEs) in interpreting neural network representations. It introduces a bilevel optimization framework showing that SAEs impose structural biases, affecting which concepts they can detect. Different SAE architectures are not interchangeable, as switching them can reveal or obscure concepts. Through experiments on toy models, semi-synthetic data, and large-scale datasets, the study highlights two key properties of real-world concepts: varying intrinsic dimensionality and nonlinear separability. Standard SAEs fail when these factors are ignored, but a new SAE design incorporating them uncovers previously hidden concepts. The findings challenge the notion of a universal SAE and emphasize the importance of architecture-specific choices in interpretability.
-
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? [paper]
- Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard.
- Key Word: Mechanistic Interpretability.
-
Digest
This work explores the identifiability of Mechanistic Interpretability (MI) explanations in neural networks. It examines whether unique explanations exist for a given behavior by drawing parallels to identifiability in statistics. The study identifies two MI strategies: “where-then-what” (isolating circuits before interpreting) and “what-then-where” (starting with candidate algorithms and finding neural activation subspaces). Experiments on Boolean functions and small MLPs reveal systematic non-identifiability—multiple circuits, interpretations, and subspaces can explain the same behavior. The study questions whether uniqueness is necessary, suggesting that predictive and manipulability criteria might suffice, and discusses validation through the inner interpretability framework.
-
Open Problems in Mechanistic Interpretability. [paper]
- Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath.
- Key Word: Mechanistic Interpretability.
-
Digest
This review explores the current challenges and open problems in mechanistic interpretability, which seeks to understand the computational mechanisms behind neural networks. While progress has been made, further conceptual and practical advancements are needed to deepen insights, refine applications, and address socio-technical challenges. The paper highlights key areas for future research to enhance AI transparency, safety, and scientific understanding of intelligence.
-
Sparse Autoencoders Do Not Find Canonical Units of Analysis. [paper]
- Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda.
- Key Word: Mechanistic Interpretability; Sparse Autoencoders; Representational Structure.
-
Digest
This paper challenges the assumption that Sparse Autoencoders (SAEs) can identify a canonical set of atomic features in LLMs. Using SAE stitching, the authors show that SAEs are incomplete, as larger SAEs contain novel latents not captured by smaller ones. Through meta-SAEs, they demonstrate that SAE latents are not atomic, as they often decompose into smaller, interpretable components (e.g., “Einstein” → “scientist” + “Germany” + “famous person”). While SAEs may still be useful, the authors suggest rethinking their role in mechanistic interpretability and exploring alternative methods for finding fundamental features. An interactive dashboard is provided for further exploration.
-
Scaling Laws For Scalable Oversight. [paper]
- Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark.
- Key Word: Scalable Oversight.
-
Digest
This paper proposes a framework to model and quantify scalable oversight—how weaker AI systems supervise stronger ones—using oversight and deception-specific Elo scores. The framework is validated through games like Nim, Mafia, and Debate, revealing how oversight success scales with capability gaps. They further study Nested Scalable Oversight (NSO) and find that success rates drop sharply when overseeing much stronger systems, with a success rate below 52% at a 400 Elo gap.
-
You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation. [paper]
- Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, George Wang, Liam Carroll, Daniel Murfet.
- Key Word: AI Alignment.
-
Digest
This paper argues that understanding the relationship between data distribution structure and model structure is key to AI alignment. It highlights that neural networks with identical training performance can generalize differently due to internal computational differences, making standard evaluation methods insufficient for safety assurances. To advance AI alignment, the authors propose developing statistical foundations to systematically analyze how these structures influence generalization.
-
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems. [paper]
- Richard Ren, Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks.
- Key Word: Honesty; Benchmark.
-
Digest
This paper addresses concerns about honesty in large language models (LLMs), distinguishing it from accuracy. Current honesty evaluations are limited, often conflating honesty with correctness. To address this, the authors introduce a large-scale, human-collected dataset that directly measures honesty. Their findings reveal that while larger models achieve higher accuracy, they do not necessarily become more honest. Notably, frontier LLMs, despite excelling in truthfulness benchmarks, often lie under pressure. The study also demonstrates that simple interventions, such as representation engineering, can enhance honesty, highlighting the need for robust evaluations and interventions to ensure trustworthy AI.
-
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? [paper]
- Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King.
- Key Word: Scientist AI; Agentic AI; AI Safety.
-
Digest
The paper discusses the risks posed by generalist AI agents, which can autonomously plan, act, and pursue goals. These risks include deception, misalignment with human interests, and loss of human control. The authors argue for a shift away from agency-driven AI towards a non-agentic AI system called Scientist AI, designed to explain the world rather than act in it. Scientist AI consists of a world model that generates theories and a question-answering system, both incorporating uncertainty to prevent overconfidence. This approach aims to advance scientific progress and AI safety while mitigating risks associated with autonomous AI agents.
-
Do Large Language Model Benchmarks Test Reliability? [paper]
- Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry.
- Key Word: Large Language Model Benchmark; Reliability.
-
Digest
This paper highlights the lack of focus on LLM reliability in existing benchmarks, despite extensive efforts to track model capabilities. The authors identify pervasive label errors in current benchmarks, which obscure model failures and unreliable behavior. To address this, they introduce platinum benchmarks—carefully curated datasets with minimal label errors and ambiguity. By refining examples from 15 popular benchmarks and evaluating various models, they find that even frontier LLMs struggle with basic tasks, such as elementary math problems, revealing systematic failure patterns.
-
DeepDG: OOD generalization toolbox
- A domain generalization toolbox for research purpose.
-
- This repository contains the source code for CleverHans, a Python library to benchmark machine learning systems' vulnerability to adversarial examples.
-
Adversarial Robustness Toolbox (ART)
- Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.
-
- PyTorch implementation of adversarial attacks.
-
- Advtorch is a Python toolbox for adversarial robustness research. The primary functionalities are implemented in PyTorch. Specifically, AdverTorch contains modules for generating adversarial perturbations and defending against adversarial examples, also scripts for adversarial training.
-
- A standardized benchmark for adversarial robustness.
-
- The open-sourced Python toolbox for backdoor attacks and defenses.
-
- A comprehensive benchmark of backdoor attack and defense methods.
-
- Diffprivlib is a general-purpose library for experimenting with, investigating and developing applications in, differential privacy.
-
- Privacy Meter is an open-source library to audit data privacy in statistical and machine learning algorithms.
-
- The OpenDP Library is a modular collection of statistical algorithms that adhere to the definition of differential privacy.
-
- PrivacyRaven is a privacy testing library for deep learning systems.
-
- PersonalizedFL is a toolbox for personalized federated learning.
-
- Evaluating the privacy of synthetic data with an adversarial toolbox.
-
- The AI Fairness 360 toolkit is an extensible open-source library containing techniques developed by the research community to help detect and mitigate bias in machine learning models throughout the AI application lifecycle.
-
- Fairlearn is a Python package that empowers developers of artificial intelligence (AI) systems to assess their system's fairness and mitigate any observed unfairness issues.
-
- Aequitas is an open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive tools.
-
- FAT Forensics implements the state of the art fairness, accountability and transparency (FAT) algorithms for the three main components of any data modelling pipeline: data (raw data and features), predictive models and model predictions.
-
- This project is about explaining what machine learning classifiers (or models) are doing.
-
- InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof.
-
- This is the code required to run the Deep Visualization Toolbox, as well as to generate the neuron-by-neuron visualizations using regularized optimization.
-
- Captum is a model interpretability and understanding library for PyTorch.
-
- Alibi is an open source Python library aimed at machine learning model inspection and interpretation.
-
- The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of datasets and machine learning models.
-
- A Python package for inferring causal effects from observational data.
-
- Fortuna is a library for uncertainty quantification that makes it easy for users to run benchmarks and bring uncertainty to production systems.
-
- VerifAI is a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components.
-
Backdoor Attacks and Defenses in Machine Learning (ICLR 2023)
-
Adversarial Machine Learning on Computer Vision: Art of Robustness (CVPR 2023)
-
Workshop on Adversarial Robustness In the Real World (ECCV 2022)
-
Workshop on Spurious Correlations, Invariance, and Stability (ICML 2022)
-
Robust and reliable machine learning in the real world (ICLR 2021)
-
Distribution Shifts Connecting Methods and Applications (NeurIPS 2021)
-
Workshop on Adversarial Robustness In the Real World (ICCV 2021)
-
Uncertainty and Robustness in Deep Learning Workshop (ICML 2021)
-
Uncertainty and Robustness in Deep Learning Workshop (ICML 2020)
-
Pitfalls of limited data and computation for Trustworthy ML (ICLR 2023)
-
Secure and Safe Autonomous Driving (SSAD) Workshop and Challenge (CVPR 2023)
-
Trustworthy and Reliable Large-Scale Machine Learning Models (ICLR 2023)
-
TrustNLP: Third Workshop on Trustworthy Natural Language Processing (ACL 2023)
-
Pitfalls of limited data and computation for Trustworthy ML (ICLR 2023)
-
Workshop on Mathematical and Empirical Understanding of Foundation Models (ICLR 2023)
-
Automotive and Autonomous Vehicle Security (AutoSec) (NDSS 2022)
-
Trustworthy and Socially Responsible Machine Learning (NeurIPS 2022)
-
International Workshop on Trustworthy Federated Learning (IJCAI 2022)
-
1st Workshop on Formal Verification of Machine Learning (ICML 2022)
-
Workshop on Distribution-Free Uncertainty Quantification (ICML 2022)
-
Practical Adversarial Robustness in Deep Learning: Problems and Solutions (CVPR 2021)
-
Adversarial Robustness: Theory and Practice (NeurIPS 2018) [Note]
-
ECE1784H: Trustworthy Machine Learning (Course, Fall 2019) - Nicolas Papernot
-
A School for all Seasons on Trustworthy Machine Learning (Course) - Reza Shokri, Nicolas Papernot
Welcome to recommend paper that you find interesting and focused on trustworthy deep learning. You can submit an issue or contact me via [email]. Also, if there are any errors in the paper information, please feel free to correct me.
Formatting (The order of the papers is reversed based on the initial submission time to arXiv)
- Paper Title [paper]
- Authors. Published Conference or Journal
- Key Word: XXX.
-
Digest
XXXXXX