Skip to content

Week 5. Feb. 7: Transformers for Multi-Agent Simulation - Orienting #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
avioberoi opened this issue Feb 5, 2025 · 23 comments
Open

Comments

@avioberoi
Copy link
Collaborator

Post your question here about the orienting readings:

“Simulating Subjects: The Promise and Peril of AI Stand-ins for Social Agents and Interactions” by Austin C. Kozlowski and James A. Evans;

@youjiazhou
Copy link

When designing a multi-agent, how to balance the requirements of different tasks? For example, you might want to avoid uniformity when designing a chatbot, but when letting it write code or math, we might expect uniformity and standard answers. It seems jittering can't fulfill both purposes at the same time. Then, if we want the agent to show different uniformities and features for different tasks, rather than just implementing specific tasks and requirements, how can we achieve this technically?

@Sam-SangJoonPark
Copy link

This article discusses the potential and limitations of LLMs in social science research. LLMs have opened up new horizons by enabling simulations models that were previously impossible. However, also come up with challenges such as bias, low variance, and atemporality and so on, which require careful attention when accepting the results.

My question is, about how the definition of valuable research or skills seem to change as these technologies evolve.
For example, even before the emergence of AI models like ChatGPT, advancements in computer technologies enabled debugging easier, leading to a shift in value. Rather than simply being good at coding, it became more valuable to know how to apply these skills to create value.

Similarily, in social sciences, I wonder if the criteria for what good social science researches are will also change. Would that be different or remain same? How might technological progress reshape the standards for evaluating research?

@kiddosso
Copy link

kiddosso commented Feb 6, 2025

Some AI startups have started to train character AIs. These AIs are supposed to have unique identities and characteristics. I wonder how can we leverage these new character AI models in social scientific research. Given the increasingly smaller budget of training LLMs, is it possible to request some custom made LLMs from those character AI companies? Is it even possible in the future for the social science researchers to build up custom made character LLMs for themselves?

@xpan4869
Copy link

xpan4869 commented Feb 6, 2025

This article argues that Large Language Models (LLMs), trained on vast amounts of online text, can mimic the perspectives and linguistic styles of diverse social and cultural groups, making them potentially powerful tools for social science research. However, the authors also highlight significant limitations of current models, including atemporality (lack of temporal understanding), social acceptability bias, uniformity (overly consistent responses), and poverty of sensory experience (inability to capture non-textual, sensory inputs). Despite these limitations, the authors suggest that LLMs could still play a valuable role in simulating human subjects and interactions.

However, many researchers in academia remain skeptical about whether LLMs can truly replace traditional field studies and experiments. They argue that simulations lack the depth, nuance, and real-world context that come from direct human interaction and observation.

Given these concerns, do you see a future in using LLMs for simulation in social science research? Are we being too optimistic about their potential, or do the benefits of scalability and accessibility outweigh the limitations? How might we balance the use of LLMs with traditional methods to ensure rigorous and ethical research

@lucydasilva
Copy link

My immediate impression of the substitutability between humans and language models (for the purpose of research in the social sciences and in general) is that there must obviously be intense ethical concerns. But, after ten minutes of rumination, I still cannot figure out what these concerns are. In terms of human-human relationships, mutual substitutability is, on the one hand an affront to human dignity/individuality/autonomy, but, on the other hand, is a coercive mechanism used to govern and exploit citizens and laborers. That's beside the point -- the substitutability at work in human-LLM relationships (for the purpose of research) has little to do with exploitation/dignity. It's difficult to exploit/indignify a machine (I think). So I'm really interested in the distinction between humans/LLMs in this specific article because I think it could help elucidate what ethical categories are at work here, but I do not know if a discussion post is the way to do that. So, preliminary questions:first of all, what end/goal/objective does using LLMs as social agents serve? Second of all, why are LLMs preferable to humans? Is ease of access to usually inaccessible "corporeal" people really the only reason? Third of all, rather than focusing on substitutability as the preeminent ethical concern, what are the ethical implications of reconfiguring the "ease of access" to otherwise cloistered or niche figures? What does this do to the scope of knowledge, and what does this do to the people for whom the model acts as a substitute?

@tyeddie
Copy link

tyeddie commented Feb 7, 2025

I’m curious about the use of LLMs as digital stand-ins in public opinion and survey research. If the goal is to assess public sentiment on novel events where no prior ground truth exists, to what extent can AI agents accurately capture human perspectives? Even with fine-tuning to enhance their sensitivity to public sentiment, could this process introduce biases from the researchers?

@yangyuwang
Copy link

Though the authors mentioned the atemporality challenge of AI stand-ins, I am still curious about how we could fine-tuned LLMs to face the problem. For example, if we can employ some ideas from research on opinion changes (such as active updating model from Kiley and Vaisey, 2020), could we tune the LLMs by the texts of events happened in certain years to evoke the opinions in that year, and make it able to be temporal?

@zhian21
Copy link

zhian21 commented Feb 7, 2025

Kozlowski and Evans (2024) explores the use of LLMs in social science research, highlighting their benefits, such as scalability, accessibility, and the ability to simulate hard-to-reach populations, while also addressing key limitations, including bias, uniformity, and a lack of human-like cognition. To improve the accuracy and ethical integrity of AI-generated simulations, the study suggests methodological refinements like fine-tuning, persona jittering, and multimodal training, alongside ethical considerations regarding consent and representation. Given the potential for LLMs to revolutionize social science research through large-scale, cost-efficient simulations, how can researchers ensure validity, reliability, and ethical integrity while mitigating the limitations of these AI-driven studies?

@psymichaelzhu
Copy link

If AI simulation is widely used in social science research, is it possible to have a reverse impact on the research methods of social science? For example, will sociological research increasingly tend to focus on phenomena that can be effectively simulated by AI, while ignoring the parts that are difficult to model with AI?

@JairusJia
Copy link

When using LLMs as social agents for social science research, how can we distinguish whether AI-generated opinions are a true representation of social culture or are influenced by the biases of training data and model architecture?

@christy133
Copy link

christy133 commented Feb 7, 2025

I think it's interesting that models show systematic preferences/tendencies though fine-tuning, while giving less response variance/more consistent than human populations. Both these phenomena might stem from the same underlying characteristic of LLMs: their tendency to find and replicate clear patterns in their training data. The uniformity in responses could be seen as a 'bias' at a deeper level. The model's bias toward the most statistically common or "safe" responses rather than exploring the full distribution of possible responses. How does the strength of a bias in an LLM correlate with response uniformity? For example, when an LLM shows stronger political bias, does it also show less variance in its responses to politically-charged questions?

@haewonh99
Copy link

I'm a bit confused about the concept of using multiple agents to mimic social systems. If we are using AIs performing similar tasks and learning from one another, how can we make sure that their thinking process is not 'converged'? Are we differentiating the agents by training them on different data while they are learning, before they start interacting with each other? If so, how is this different from combining a chunked dataset into one massive dataset and training a singe model on it?

@Daniela-miaut
Copy link

The “ground truth” in the simulated interaction seems not testable if not including real human interactions. Are there ways to examine the validity of the simulation results (and the knowledge we acquire from them)?

@ulisolovieva
Copy link

I'm still unsure about LLMs' ability to replace human subjects in certain types of studies. With human participants, we can directly investigate their reasoning through free responses or secondary measures. While chain-of-thought improvements may help LLMs articulate their decision-making processes, offering a potential way forward, it's unclear whether this truly captures the idiosyncrasies of human reasoning.

What are some other approaches to prompting and fine-tuning LLMs or improvements that need to happen in order to better approximate human variability?

@ana-yurt
Copy link

ana-yurt commented Feb 7, 2025

Often, imbuing LLMs with a personality feels like giving an actor a script to read and play in a drama, but while they can simulate personas or characters, they retain full access to their full knowledge base, potentially leading to anachronistic or contextually inappropriate responses. For example, when asked to role-play as a medieval peasant, the model might inadvertently reference modern concepts of medieval peasant that wouldn't have been available in the time period. How can we adjust the prior knowledge to "mask out" the parts that the LLM (as a simulated agent) should not know? What architectural modifications can be done?

@xiaotiantangishere
Copy link

I have a question regarding the empirical validation between AI-generated and real human responses. If AI-generated data is not significantly different from real human data, it would mean that everything AI produces already exists in human responses, then AI-generated data seems redundant. But if AI generates something that doesn't exist in human data, how do we know if it’s meaningful and unbiased? This seems like a paradox to me.

@chychoy
Copy link

chychoy commented Feb 8, 2025

According to the paper, a key issue with LLM and find tuning is social bias (either to be more liberal or more conservative). However, like all things human, while bias could be measured with specific metric, how do researchers measure "realness" while evaluating bias? Building on that note, how much are we looking for an "unbiased" machine, and how much are we looking for a "real" machine?

@CongZhengZheng
Copy link

Given that LLMs are trained on vast and varied datasets that reflect current socio-cultural dynamics, how do you account for the dynamic nature of cultural norms and sentiments in your simulations, especially considering that these norms evolve over time? What ethical frameworks and considerations have you proposed or thought necessary to guide such simulations, especially when they might influence real-world decisions or policies?

@DotIN13
Copy link

DotIN13 commented Feb 14, 2025

How can we design LLM-based social simulations that dynamically adapt to temporal shifts in public sentiment and cultural norms? Would integrating historical textual data in a structured way—such as training separate temporal models or using adaptive fine-tuning mechanisms—help LLMs better reflect historical contexts rather than just projecting contemporary biases? Can we code unstructured temporal culture data into traditional data structure and bypass the bias in those models?

@siyangwu1
Copy link

How can researchers systematically evaluate whether AI-generated simulations fairly represent diverse social perspectives, particularly those of marginalized or underrepresented groups? What methodologies could be used to mitigate the risk of AI models amplifying mainstream biases while still producing coherent and generalizable insights for social science research?

@baihuiw
Copy link

baihuiw commented Mar 8, 2025

Among these challenges(atemporality, social acceptability bias, uniformity, and poverty of sensory experience), which do you think poses the greatest risk to the validity of AI-driven social science research, and why? How might researchers mitigate this limitation to ensure more accurate and reliable simulations?

@shiyunc
Copy link

shiyunc commented Mar 10, 2025

I am especially interested in the uniformity problem, because in many cases what social scientists care about is exactly the variation of the population, instead of a general representation. The paper mentioned that one way to address this is to "jitter" personas, causing deviation across the personas. The way of implementing this (e.g., interpolating steering vectors, adding a sample of slightly modified prompts), however, seems to cause more randomness. Is this man-made disturbance meaningful in terms of studying more authentic variation of the target group?

@CallinDai
Copy link

We learned that this study evaluates the reliability of Large Language Models (LLMs) in simulating human behavior by comparing LLM-generated responses to real human survey data, particularly in measuring moral values and attitudes. This makes me think—given that the study finds discrepancies between LLM-generated responses and human data, especially in context-sensitive moral judgments, to what extent do these differences stem from the model’s learned linguistic distribution rather than genuine moral reasoning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests