Skip to content

Week 6. Feb. 14: Reinforcement Learning - Orienting #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ShiyangLai opened this issue Feb 8, 2025 · 18 comments
Open

Week 6. Feb. 14: Reinforcement Learning - Orienting #14

ShiyangLai opened this issue Feb 8, 2025 · 18 comments

Comments

@ShiyangLai
Copy link
Collaborator

Post your questions here about:

“Why Reinforcement Learning”, Reinforcement Learning, chapter 1.

“Reinforcement Learning”, Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow, chapter 18.

“Deep Reinforcement Learning from Human Preferences.” 2017. Christiano et al. NeurIPS.

@chychoy
Copy link

chychoy commented Feb 13, 2025

In the paper "Deep Reinforcement Learning from Human Preferences," it seems that a key trait of these models is allowing humans decisions to be the drivers of building a rewards function rather than mandating humans to build the rewards function themselves. A question that I have relates to an earlier reading, where we considered the ethics in making certain decisions for self-driving cars. How do we ensure that the "human preferences" we see here are also "ethical"? Furthermore, is it possible to retroactively extract and interpret what the machine derived as a "rewards function," especially when we are considering dilemmas that might be more contested morally and ethically?

@yangyuwang
Copy link

The paper “Deep Reinforcement Learning from Human Preferences” designed a reinforcement learning way of using human preferences to adjust the reinforcement function. But what I concerned is how does this human-AI interactive learning method capture the complexity of human preferences. From my perspective, human preferences would be refined by human's demographic attributes, and also the time and space. If one person who interact with the reinforcement learning changed their preference across time (whether intentionally or not on purpose), how could the machine learn about the "true" human preferences? Or moving forward, would this method assume that there is a universal or general human preference across human beings? And if we want to check the changes and differences in preferences, and debias them, how can we adjust the method?

@kiddosso
Copy link

For the RLHF paper, what are the specific processes through which humans give responses to the agent? In my understanding, men sometimes cannot judge which response is better than the other. The human feedback process may involve some personal preferences that are not that objective. So how can the RLHF workers prevent or exploit it? Does RLHF also pave the way for character AI?

@xpan4869
Copy link

In the paper 'Deep Reinforcement Learning from Human Preferences,' the authors provide an alternative method for agents to learn and achieve complicated goals compared to traditional reinforcement learning. While traditional RL approaches typically offer clear interpretability with their well-defined reward functions and decision-making processes, the introduction of human preferences seems to make the decision-making process less transparent, potentially introducing unintended biases and making it harder to detect and correct systematic errors. I am wondering what strategies or approaches you would suggest for balancing the benefits of human preference learning with the need for model interpretability?

@zhian21
Copy link

zhian21 commented Feb 14, 2025

Christiano et al. (2017) demonstrate that human preferences over trajectory segments can be leveraged to train deep reinforcement learning agents without predefined reward functions. By presenting humans with pairwise trajectory comparisons and using these judgments to learn a reward model, the authors achieve comparable performance to traditional RL on complex tasks, such as Atari games and robotic locomotion, with significantly less human feedback. This scalable approach highlights the potential for applying RL to tasks where reward specification is difficult, while introducing new challenges related to preference consistency and reward model generalization. Given the observed discrepancies in human and synthetic feedback efficiency, how might future models adaptively calibrate human input quality to further reduce oversight requirements?

@ulisolovieva
Copy link

How do we create reward functions that lead to desired behaviors without unintended consequences? (e.g., predicting job candidate success might incentivize high immediate performance & lead to bias; optimizing for user engagement might promote polarization). What makes rewards change over time? And does it always reflect the agent’s changes in “better = more rewarding” policy?

@xiaotiantangishere
Copy link

In the paper's feedback collection process, humans are asked to compare two trajectory segments and select the preferable one. However, if neither is ideal, this might introduce bias into the learned reward function. Meanwhile, in specific applications like autonomous driving—resonating with discussions from previous readings—would it be more effective to incorporate explicit negative feedback to define "red lines" and prevent undesirable behaviors, rather than solely relying on preference-based optimization?

@Sam-SangJoonPark
Copy link

In the paper Deep Reinforcement Learning from Human Preferences approach, I find it fascinating that this method achieves strong performance with minimal human feedback. The idea of learning from human preference comparisons instead of predefined reward functions is compelling because it allows for more flexible and intuitive learning, especially in complex tasks where reward design is difficult.

However, since I haven’t worked with reinforcement learning before, one of my main concerns is how performance is evaluated in this setup. In traditional reinforcement learning, performance is often measured using explicit reward signals (e.g., total score in a game, cumulative reward over time). But in this case, since the reward function itself is learned from human feedback, I wonder how we can ensure that the resulting policy is actually optimal or aligned with human intent.

@Daniela-miaut
Copy link

“Deep Reinforcement Learning from Human Preferences” introduces a way to train the reinforcement learning algorithm without access to the reward function and with relatively lower cost of human oversight. The basic idea is to learn the reward function using the data from human comparison of possible trajectories of the reinforcement learning agent. This method seems promising in training algorithms to learn human preferences even without a clearly pre-defined goal.
I have a few questions on this topic:
Can reinforcement learning be used to learn more complicated human behaviors in agent-based modeling?
Would it be useful to simulate environments in societal settings in simulation?
Would it be useful to use reinforcement learning to learn human ethical preferences?
Can we (and how can we) get the embedding from the results of reinforcement learning? That is, how can we interpret the results of reinforcement learning?

@DotIN13
Copy link

DotIN13 commented Feb 14, 2025

How does the reward model in this approach handle conflicting human preferences, especially in scenarios where different human raters provide contradictory feedback on similar trajectory segments? Additionally, is there a mechanism for prioritizing or reconciling diverging human inputs to ensure the learned reward function remains stable and generalizable across different human perspectives?

@psymichaelzhu
Copy link

Can RLHF be used to infer human preferences in a systematic way? Since reinforcement learning from human feedback optimizes agent's behavior based on human-provided preferences, could this process be reversed to extract an implicit formal reward function that captures human values?

@haewonh99
Copy link

While the concept is very interesting, I was a bit confused about the 'curiosity-based exploration' introduced in Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow. I assume that in curiosity-based exploration, the 'surprised' AI would have been programmed to explore more of the discrepancies that caused the surprise. If so, Isn't it structurally similar to reward-based learning acting on rewards in the sense that they are different ways to refer to the the KPI that the AI should design their future action based on the number?

@JairusJia
Copy link

Can the efficiency of human preference feedback in deep reinforcement learning be further optimized to reduce reliance on manual labeling?

@tyeddie
Copy link

tyeddie commented Feb 14, 2025

In what situations might a model-based approach be more advantageous than a model-free one? Does the selection resemble the choice between parametric and non-parametric methods as in stats?

@siyangwu1
Copy link

How can reinforcement learning systems adapt to evolving human preferences over time without requiring continuous manual oversight? Are there mechanisms to detect and adjust for preference drift, ensuring that learned policies remain aligned with long-term human goals rather than overfitting to transient preferences?

@CongZhengZheng
Copy link

How do you use reinforcement learning to improve cloud computing? I am relating to the research I did, where I used algorithms to find the best route of communication between computing nodes. Here it says datacenter cooling, CPU cooling etc. are all RL applications, so how exactly does it reflect the idea of autonomous learning and optimisation in these cases? When it says ‘reward’ here, how exactly does it define a reward? It is like getting a positive answer from the mathematical formula?

@shiyunc
Copy link

shiyunc commented Mar 11, 2025

The different rate of exploration (e.g., fixed, prediction-based) is essential for the algorithm to balance the exploration-exploitation trade-off. How does the choice of exploration strategy affect the convergence speed of reinforcement learning algorithms in large state spaces? In multi-agent reinforcement learning, how does the exploration-exploitation trade-off change when agents must cooperate versus compete?

@CallinDai
Copy link

We learned that reinforcement learning from human preferences enables agents to learn complex behaviors without predefined reward functions by using pairwise human trajectory comparisons. The study demonstrates that even a small amount of human feedback can shape agent behavior effectively, sometimes surpassing traditional reinforcement learning with hand-crafted rewards. This makes me think—how does the choice of trajectory segments influence learned reward functions? Specifically, could the structure of human feedback (e.g., short vs. long trajectory comparisons, sequential vs. random sampling) systematically bias the agent’s learned policy, leading to overfitting on superficial behavioral cues rather than deep task understanding? Could this impact generalization in open-ended, real-world tasks where human preferences are inconsistent or context-dependent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests