Skip to content

Week 6. Feb. 14: Reinforcement Learning - Possibilities #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ShiyangLai opened this issue Feb 8, 2025 · 17 comments
Open

Week 6. Feb. 14: Reinforcement Learning - Possibilities #15

ShiyangLai opened this issue Feb 8, 2025 · 17 comments

Comments

@ShiyangLai
Copy link
Collaborator

Pose a question about one of the following articles:

Human-level control through deep reinforcement learning” 2015. V. Mnih...D. Hassabis.

“Direct Preference Optimization: Your Language Model is Secretly a Reward Model”. 2023

Learning ‘What-if’ Explanations for Sequential Decision-Making” (2021).

Improved protein structure prediction using potentials from deep learning” (2020). 

Machine Theory of Mind” (2018)

Explainability in deep reinforcement learning” (2021)

@yangyuwang
Copy link

For the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model”, it raised an idea of using DPO (Direct Preference Optimization) to simultaneously act as both generative model and reward model through reparameterization of reward function. In this case, the probability of generated answer would be reused as a implicit rewards to the model. I would like to ask upon this point: If the internal probability distribution of the language model is biased or unstable, how might that affect the calculation of the implicit reward, potentially leading to generated outputs that do not align with human preferences? Moreover, what mechanisms should be designed to detect and correct such intrinsic biases to prevent self-reinforcing errors?

@kiddosso
Copy link

The human-level control through deep reinforcement learning paper briefly introduces why deep learning network does not work very well with q-learning before this paper coming out. However, this introduction is still too short to me. I still cannot figure out why these two methods cannot work that well. Could you give more explanation on this issue?

@zhian21
Copy link

zhian21 commented Feb 14, 2025

Mnih et al. (2015) introduce a deep Q-network that integrates reinforcement learning with deep neural networks to achieve human-level performance in Atari 2600 games. By using high-dimensional sensory inputs and game scores, the DQN applies experience replay and target network stabilization to address the instability of nonlinear function approximators, learning robust policies across diverse environments with minimal task-specific adjustments. This work marks a breakthrough in generalizable learning from raw perceptual data without manual feature engineering. Given the success of DQN in relatively constrained environments, what architectural or algorithmic adaptations might be necessary to extend similar performance to more dynamic, real-world tasks?

@ulisolovieva
Copy link

Social coordination perpetuates stereotypic expectations and behaviors across generations in deep multi-agent reinforcement learning

The paper suggests that stereotypes may emerge from social coordination rather than bias or cognitive limitations. Using multi-agent reinforcement learning (MARL), authors show how agents adjust their behavior to match social expectations, reinforcing a stereotype like a self-fulfilling prophecy. In a simulated environment, agents with varying skills traded resources with a "market decider," who predicted their actions and rewarded alignment between its predictions and agent behavior. Instead of recognizing individual skills, the market decider used group labels, leading agents to conform to stereotypes over time. Results were replicated with humans, irrespective of participants' attitudes toward social inequality.

While the paper used an econ task, MARL could theoretically be used with more applied scenarios like hiring/promotion dynamics. Training MARL models on hiring data could reveal how early biases may shape long-term behaviors. Real data could be used setup the environment/agents/rewards (e.g., hiring managers receive rewards based on perceived applicant success) which would also allow us to test for interventions to disrupt stereotype-driven feedback loops. Unlike ABM, the beauty of RL is the ability to capture learning dynamics without hard-coding agent behaviors, making this method especially exciting for studying the emergence of social biases.

How can RL methods integrate multiple conflicting rewards (e.g., efficiency from group-based heuristics vs. slower more accurate differentiation)?

@xpan4869
Copy link

Human-level control through deep reinforcement learning” 2015. V. Mnih...D. Hassabis.

This paper introduced DQN (Deep Q-Network), which successfully combined DL with reinforcement learning through two key innovations: experience replay and target network. The system could learn to play 49 different Atari games directly from raw pixel inputs, achieving human-level performance on many games using the same architecture and hyperparameters. This work demonstrated that a single agent can learn multiple complex tasks from raw sensory inputs, marking a significant step towards more general artificial intelligence.

It seems like DQN is now such a powerful tool in manufacturing, recommendation system, traffic control, etc. However, in those critical applications like healthcare (any public goods usage) or financial trading, how can we ensure the DQN's decisions are reliable and safe? What additional mechanisms would be needed?

@Sam-SangJoonPark
Copy link

If reinforcement learning benefits from "experience replay" by reusing past data, but unprecedented big events like COVID-19 disrupt social patterns, how can social science models—similarly reliant on historical data—adapt to such unforeseen changes?

@Daniela-miaut
Copy link

“Human-level control through deep reinforcement learning” introduces a method to process high-dimensional inputs of sensory data, which reminds me of the social situation that is constantly perceived by agents and modulates their interaction. I am curious whether the social situation can be simulated in agent-based-modeling, as a high-dimensional data similar to the sensory information.

@chychoy
Copy link

chychoy commented Feb 14, 2025

In the paper, "Machine Theory of the Mind," the authors discuss constructing an "observer," who "gets access to a set of behavioral traces of a novel agent . . . to make predictions of the agent's future behavior." I am also interested in how they phrased this observation as having LLM models to infer other models' latent characteristics and "mental states." I am curious to see how does this help with the interpretability of machines on a practical level--as while the authors do discuss interpretability, it seems to still not succeed in doing so. Furthermore, this path seems to create an eternal cycle of using machines to understand machines, which, at the top-most level, still depends on humans to trust machine decisions "just because." Furthermore, I wonder what does "interpretability" even mean in context of these models--it seems here that we are trying to reach some form of "human understanding" of a model's "thought processes," which, while it seems to be a valuable goal, seems to also be unrealistic as human thought processes are usually uninterpretable.

@psymichaelzhu
Copy link

(Rabinowitz et al., 2018) In this article, ToMnet can understand and predict the potential states (such as preferences and beliefs) and behaviors of other agents. Can similar ideas be applied to learning physical laws?

@DotIN13
Copy link

DotIN13 commented Feb 14, 2025

What are the key differences between Direct Preference Optimization (DPO), reward models, and world models as introduced by Yann LeCun? Specifically:

  • How does DPO function differently from traditional reward models in reinforcement learning?
  • While reward models are typically trained to predict human preferences, do they fundamentally differ from the predictive capabilities of world models?
  • Can DPO and world models be integrated to create more sample-efficient and human-aligned decision-making agents?

@haewonh99
Copy link

Machine Theory of Mind” (2018)
While the experiment and implication of the research is interesting, I was wondering how we could know that the rather simple 'mind models' that the 'observer' had been able to construct about other agents could be extended to complex entities, like HCI as the paper suggets. What are some difficulties that current models would have to overcome to 'read' other's minds? How are they being dealth with?

@JairusJia
Copy link

DQN improves stability through experience replay and target network, but are these methods still effective in more complex environments (such as non-Markov decision processes or high-dimensional continuous action spaces)? How can they be improved to adapt to more challenging reinforcement learning tasks?

@tyeddie
Copy link

tyeddie commented Feb 14, 2025

How can what we’ve learned about general feature learning help us apply transfer learning to complex, real-world settings? What difficulties might arise when using end-to-end reinforcement learning (which worked well on Atari) on tasks with richer sensory inputs and more complex dynamics?

@siyangwu1
Copy link

How can explainability techniques be more tightly integrated into the learning process of deep reinforcement learning models, rather than being applied post-hoc? Could a model be trained to inherently optimize for both performance and explainability simultaneously, and what trade-offs might emerge between these objectives?

@CongZhengZheng
Copy link

In this paper “Machine Theory of Mind”, What are the datasets of the three nets? Why when modelling the theory of mind, it has to divide into character net, mental state net, and a prediction net? It seems that the paper does not reveal its dataset and its collection method, so I am curious what dataset can model the behaviours of gannets. Also, why the author design these topics in the belief that they will model the theory of mind?

@CallinDai
Copy link

Machine Theory of Mind is such a fun read! I am curious what does ToMnet’s ability to infer false beliefs reveal about the computational mechanisms underlying human Theory of Mind, and how might it inform debates on innate versus learned social cognition?

@CongZhengZheng
Copy link

The article "Learning 'What-If' Explanations for Sequential Decision-Making" by Bica et al. (2021) introduces Counterfactual Inverse Reinforcement Learning (CIRL). The method addresses the challenge of understanding expert decision-making based on observed behavior by inferring preferences over counterfactual outcomes. Instead of simply matching expert performance, CIRL interprets decisions through trade-offs among possible outcomes. It uses batch inverse reinforcement learning combined with counterfactual reasoning to recover interpretable reward functions, even in partially observable environments, where actions depend on historical trajectories rather than current states alone. Through experiments in simulated medical scenarios and real-world ICU data (MIMIC-III), the study demonstrates CIRL’s ability to accurately identify expert preferences, such as prioritizing temperature reduction and white blood cell count normalization when prescribing antibiotics.

This method could significantly extend social science analysis by providing interpretable models of human decision-making in complex, dynamic environments. In fields like policy design, education, or behavioral economics, CIRL could be used to unpack the latent preferences of policymakers, teachers, or consumers, respectively. For example, CIRL could reveal how educators balance short-term student performance metrics against long-term engagement and retention. By modeling their reward functions through counterfactual scenarios—what might happen if different pedagogical approaches were taken—social scientists can better understand and audit institutional decision-making.

To pilot this use, I would utilize educational data from the U.S. Department of Education’s National Assessment of Educational Progress (NAEP) dataset, which includes student outcomes, teacher actions, and school-level demographics over time. Specifically, I would focus on decisions around resource allocation (e.g., additional instructional time or tutoring services) and subsequent changes in student performance. The dataset’s sequential structure—student outcomes recorded over multiple years—makes it ideal for CIRL’s historical modeling approach. I would implement CIRL to recover reward weights indicating whether teachers and administrators prioritize immediate test score improvements or longer-term student growth and well-being. Accessing NAEP data is feasible through the National Center for Education Statistics (NCES), and additional context could be drawn from longitudinal studies like the Early Childhood Longitudinal Study (ECLS). Together, these data sources would allow for a robust pilot application, translating CIRL’s counterfactual reasoning and reward interpretation framework from healthcare into the domain of educational policy analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests