-
Notifications
You must be signed in to change notification settings - Fork 4
Week 9. Mar.7: Multi-Modal Learning and Explainability - Possibilities #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The paper AGENT AI: Surveying the Horizons of Multimodal Interaction explores the development of Agent AI, a class of interactive AI systems that integrate multimodal perception, human feedback, and embodied actions. The authors argue that Agent AI represents a pathway toward Artificial General Intelligence (AGI) by enabling models to process visual, linguistic, and environmental data in real-time interactions. Unlike traditional AI systems that operate in limited, predefined environments, Agent AI is designed to adapt dynamically across both physical and virtual spaces. The paper highlights key challenges, such as reducing AI hallucinations, mitigating biases, ensuring interpretability, and improving real-world integration. Additionally, it explores how large foundation models (LLMs and VLMs) can be leveraged for embodied AI systems in domains such as robotics, gaming, and healthcare. The methodology introduced in this paper has profound implications for social science research, particularly in the study of human-computer interaction, digital behavior, and adaptive learning environments. Social science often relies on observational studies, surveys, and controlled experiments, which can be resource-intensive and limited by ethical concerns. By using Agent AI systems, researchers could simulate complex social behaviors and test hypotheses at scale. For example, AI-driven agents could model online discourse, decision-making in group dynamics, or responses to misinformation campaigns in a controlled setting. This would allow for replicable and dynamic studies of human-like behavior in diverse contexts. To pilot such an application, I would propose using real-world multimodal interaction data from publicly available conversational datasets, such as Reddit threads, YouTube comment sections, or Twitter discussions, combined with gesture, speech, and facial expression data from video-based communication platforms like Zoom or Microsoft Teams. By integrating sentiment analysis, discourse modeling, and behavioral tracking, an Agent AI system could simulate how individuals respond to different social cues, misinformation, or emotionally charged interactions. Researchers could then modify environmental variables (e.g., introducing fact-checking interventions or varying social network structures) to study how different sociocultural factors shape online discourse and decision-making. This approach would provide scalable, ethical, and repeatable models for studying social interactions in digital spaces and beyond. |
C. Choi, S. Yu, M. Kampffmeyer, A. -B. Salberg, N. O. Handegard and R. Jenssen, "DIB-X: Formulating Explainability Principles for a Self-Explainable Model Through Information Theoretic Learning," ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 7170-7174, doi: 10.1109/ICASSP48485.2024.10447094.
The DIB-X model introduces a self-explainable deep learning approach that aligns with explainability principles using an information-theoretic learning framework. Unlike traditional post-hoc explainability methods that interpret deep models after training, DIB-X integrates explainability directly into the learning process. DIB-X employs Rényi’s α-order entropy functional to measure mutual information while avoiding assumptions about data distributions. It applies deep deterministic information bottleneck (DIB) learning to balance information retention and classification performance. The study validates this approach using datasets such as MNIST, marine monitoring images, and echosounder data, demonstrating both improved interpretability and accuracy compared to existing models like Grad-CAM and VIB-X.
The self-explainability framework of DIB-X offers significant potential for social science research, particularly in policy analysis, media studies, and public opinion research. Current machine learning models in social sciences often struggle with interpretability, making it difficult to understand why a model arrives at certain conclusions. DIB-X, however, provides an inherently explainable way to analyze complex multimodal datasets.
We could implement this idea into the image detection. For example, I would like to predict how a painting belongs to styles. After training the NNs, the DIB-X could show which parts of the painting are more likely to related to certain styles. In this way, we could see how one style related to some specific parts in the paintings. The pilot study could use established artwork datasets, and predict the styles labeled by art critics. Then we can visualize the parts related to certain styles. |
Reflection on “When Continue Learning Meets Multimodal Large Language Model: A Survey” (Huo & Tang, 2024)
|
https://www.nature.com/articles/s43588-023-00573-5 Summary: The methodology proposed in the article presents an exciting opportunity for extending social science research, particularly in the study of life course dynamics. Traditionally, social scientists have relied on surveys, longitudinal studies, and census data to analyze life trajectories. However, these methods often suffer from limitations such as recall bias and missing data. By embedding human life events into a structured vector space, researchers can analyze patterns with unprecedented granularity, allowing for predictive modeling of social mobility, economic inequality, and health outcomes. Insight for Social Science: Data for Pilot Implementation: In addition to medical records, smartphone sensor data could serve as a complementary source of information. With user consent, smartphone applications could collect passive data on movement patterns, screen time, social interactions (e.g., call and text frequency), and app usage, providing real-time insights into behavioral trends. By integrating these datasets, researchers could extract key life events and construct a multi-modal model of human life trajectories. By implementing these techniques, we could unlock new dimensions in social science research, offering more precise and dynamic insights into how life events shape individual and societal outcomes. |
Guilbeault et al. (2024) investigate how online images, particularly those from search engines and social media, amplify gender bias more than textual content. Using large-scale computational analysis of over one million images from Google, Wikipedia, and IMDb, alongside billions of words from these platforms, they find that gender bias is significantly stronger in images than in text. Their experimental results show that participants exposed to image-based search results develop stronger explicit and implicit gender biases about occupations. The study highlights how the shift toward visual content can exaggerate societal stereotypes, reinforcing pre-existing inequalities. From a network learning perspective, this study provides a unique opportunity to explore how social biases propagate across digital platforms using autoencoders and network-based table learning. Specifically, autoencoders—which learn compressed representations of high-dimensional data—could be leveraged to model latent gender biases in multimodal datasets (text, images, and user interactions). Moreover, a graph-based learning approach could be used to study how biases diffuse across networks of users, content creators, and algorithmic recommendations on platforms like Google and Wikipedia. To extend this approach in social science, we could apply autoencoders to de-bias image search algorithms by training models to distinguish bias-related image features (e.g., occupational stereotypes in images). Network-based learning could further help trace the evolution of gender associations across search algorithms over time, identifying how feedback loops reinforce bias. A pilot study could collect Google Image search results for various professions (e.g., “scientist,” “engineer,” “nurse”) across different geographic regions and compare their gender representations to labor force demographics (e.g., U.S. Census, Eurostat). An autoencoder trained on these images could learn a low-dimensional bias representation, quantifying the extent of stereotypical portrayals. Finally, a network-based approach could model user engagement patterns, examining how exposure to biased content affects subsequent searches and content recommendations. This application could inform algorithmic fairness interventions, helping researchers and policymakers design bias-mitigating strategies for search engines, recommendation systems, and AI-generated media. |
I thought "A Unified Model of Human Semantic Knowledge" was a very interesting article -- it was able to combine brain imaging and patient data to define a theory of word-object relations that balanced theories that emphasize the locality of word-object relations to specific category domains (animals are grouped, tools are grouped -- perhaps in a similar way to word embeddings in vector space?) and theories that emphasize a general domain or approach to word-object relations. With a theory called C3 (connectivity-constrained cognition), the authors contend that word-object relations are crystallized in the cortex through learning/experience, perceptual/linguistic/somatic framing of environment, and neural connectivity in the brain. With this framework, the authors can provide a normative framework that structures disorders and pathologies that affect the mind, and a general theory of brain processing that maps neatly onto neural network patterns. This gets at the problem of both the multimodality of brain processing (learning/perceiving/classifying through images, speech, somatic experience etc) and also explainability -- it provides a theory of the brain and the brain's processing of word-object relations that informs how neural networks take inputted information and produce an output. Whether this theory of brain has been retrofitted to address neural network explainability conundrums, emerges as a way to assert consciousness as a reflection of neural network processing, or is coincidentally analogous to NNs is up for debate. But it is an interesting theory nonetheless that allows for. a better understanding of how NNs are a reflection of brain processes -- or, even more interesting, vice versa. |
The article “A Transformer Approach to Detect Depression in Social Media” by Keshu Malviya, Dr. Bholanath Roy, and Dr. Saritha SK discusses using deep learning approaches in detecting early symptoms of depression, especially in the post-COVID era, where more people are online and feeling more isolated. The social media data are collected from “non-depressed’ and “depressed” subreddits of the Reddit platform through the Pushshift API. The two categories are not clinical differences, but rather more tonal. The authors specifically defines their depressed posts as “depressed in nature,” which is a relatively vague and unclear metric. It uses TF-IDF models and Word2Vec models as baselines, and applied transformer models as comparisons. The transformer outputs seem generally more effective, with classification accuracy scores around 0.96-0.98. |
"Towards artificial general intelligence via a multimodal foundation model" integrates image and text data to propose BriVL(Bridging-Vision-and-Language), a multimodal foundation model. It uses large-scale weakly correlated image-text data. It's goal is to model human cognitive tasks, including vision, language, and cross-modal understanding. The model was trained on 650 million image-text pairs collected in the web with self-supervised learning. The distinctive characteristic of this model is that it utilizes weak semantic correlations that learn more abstract, generalized understanding. It also uses separate image and text encoders to embed images and text in a joint embedding space. The model could generate contextual and abstract representations of text descriptions and performed well on reasoning, classifying, and image-to-text and text-to-image retrieval tasks. I think this would be an excellent model for studying prejudice and the chain of thoughts, i.e., mental representations that people have, because it learns abstract representations. Also, if trained on corpus-image association from different cultures, it would also be great for comparison of culture. I think an interesting idea would be to train it on various fiction corpus, perhaps with book covers and text of the book, from different cultures. Then, we could ask it to generate an image from the same abstract sentences, like "imagine utopia", "this is the best day in spring". It would have learned what the representative of those abstract sentences would refer in each culture, although I'm not sure about how I could generalize results from each sentence, I think it would be an interesting thing to try and off-the-shelf to compare mental associations from different cultures. |
The article discusses the challenge of societal biases in multimodal AI, particularly how certain AI-generated outputs can reinforce stereotypes. What are some concrete strategies that developers can implement to detect and mitigate these biases during both the data collection and model training phases? |
Re: Guilbeault, D., Nadler, E. O., Chu, M., Sardo, D. R. L., Kar, A. A., & Desikan, B. S. (2020). Color associations in abstract semantic domains. Cognition, 201, 104306. This article explored an interesting question in embodied cognition: does sensory data (e.g., color) contribute to the semantic structure of abstract concepts? It tested three domains of abstract concepts: disciplines, emotions, and music genres. They applied a multi-modal learning method to project words and their Google images to their digital representation and examined the correlation and clustering. The results show that color variability increases with concept abstractness, and the color distributions of the words show clustering. It supported the hypothesis that color plays a role in the semantics of abstract concepts. A question I'd like to ask is: can we spread this conclusion to broader abstract concepts? Emotion and disciplines involve different cognitive processes. The sensory association might be conveyed via emotion to other abstract concepts. Also, social norms could influence the search engine results of how to represent an abstract concept (e.g., happy ~ yellow smiling face). Extending social science analysis and possible data: |
The paper "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders" presents a novel approach to audio synthesis by leveraging WaveNet-style autoencoders. Traditional audio synthesis methods often rely on predetermined algorithms or sample playback techniques, which can limit the expressiveness and realism of the generated sounds. This study addresses these limitations by introducing a neural network-based model that learns directly from raw audio data. A key innovation of this work is the development of a WaveNet-style autoencoder that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. This architecture enables the model to capture intricate temporal structures in audio signals, facilitating the generation of high-quality and realistic sounds. The authors also introduce NSynth, a large-scale dataset comprising over 300,000 musical notes from more than 1,000 instruments, which serves as a robust foundation for training the model. Through extensive experiments, the WaveNet autoencoder demonstrated superior performance over traditional spectral autoencoder baselines, both qualitatively and quantitatively. Notably, the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are both realistic and expressive. In summary, this paper showcases the potential of advanced neural network architectures in revolutionizing audio synthesis. The integration of WaveNet autoencoders with a comprehensive dataset like NSynth paves the way for more natural and expressive sound generation, offering exciting possibilities for musicians, audio engineers, and the broader field of machine learning. Link to the paper:https://arxiv.org/abs/1704.01279 |
The paper Color Associations in Abstract Semantic Domains tests the theory of embodied cognition by quantitatively computing the relationships between concepts and the distribution of colors in their visual representations. The idea is that, if the relations between concepts and that between their image representations are found coherent, then the human perception of these concepts can be claimed to have some degree of embodiment. The study uses sample concepts from different domains with different levels of abstractness. The corresponding images are collected from the first 100 images returned by a Google search for every single word. The images are clustered in colorspace and computed for color distribution. The authors also tested for the color relations between words with hierarchical relationships in the semantic space. The results show that in the abstract domains, although not so strong as with the concrete words, the concepts are clustered by color at high statistical significance. Also, semantic similarity between words can predict the color similarity of their corresponding images. These findings support the theory of embodied cognition and shows a new way to quantitatively study the embodiment of semantic relations. |
The article Seeing is Understanding addresses a fundamental issue in Multimodal Large Language Models (MLLMs)—vision-language misalignment, a phenomenon where textual responses do not factually match the provided visual inputs. The authors propose AKI, a Multimodal Large Language Model (MLLM) enhanced with modality-mutual attention (MMA), allowing image tokens to incorporate information from text tokens. Without additional parameters or extended training times, MMA significantly improves model performance across 12 multimodal benchmarks, reducing inaccuracies like object hallucinations. This approach could notably enrich social science analyses, particularly in examining public perceptions and attitudes through multimodal social media data. A suitable pilot could use data from Instagram or Twitter to analyze how image posts (photos or memes) combined with text captions influence public opinions or stereotypes on critical social issues, such as immigration or gender roles. Researchers could collect images with associated user comments and hashtags from public posts discussing political events or social movements. By applying AKI's modality-mutual attention, the analysis would reveal how textual framing impacts visual perception among users, helping social scientists in understanding and predicting patterns in public opinion formation and attitude polarization on online platforms. |
This paper explores how Graph Neural Networks (GNNs) can enable multi-modal information fusion for explainable AI, particularly in complex domains like medicine. The authors introduce the concept of "causability" (distinct from causality) as the measurable extent to which an explanation achieves causal understanding for human experts. They propose a framework that uses GNNs to integrate diverse data types—images, text, genomics—into a unified representation space where causal links between features are directly encoded in graph structures. This approach allows for interactive "what-if" questions (counterfactuals) that help experts gain deeper insights into AI decision processes. The authors outline three core challenges: (1) constructing a multi-modal embedding space that bridges semantic gaps between different data types, (2) developing distributed graph representation learning techniques for decentralized data, and (3) creating explainable interfaces that enable meaningful human-AI interaction through counterfactual exploration. This GNN-based multi-modal fusion approach could revolutionize social media analysis by integrating multiple data streams that reflect complex social phenomena. The counterfactual exploration capability would be particularly valuable for policy research, allowing analysts to pose "what-if" questions about intervention outcomes. For example, researchers could explore how changes in social network structure might affect the spread of misinformation when combined with specific content features. I would implement this approach using a dataset from Twitter that captures multiple modalities related to political discourse during election periods. The dataset would include: (1) tweet text, (2) shared images, (3) user network connections, (4) engagement metrics, and (5) temporal patterns across different geographic regions. Implementation would begin by constructing modality-specific representations: text would be processed with language models, images with vision models, and network structures with traditional graph embeddings. These representations would then be connected through a knowledge graph serving as an "interaction & correspondence graph" as described in the paper. |
Abhishek Mandal, Susan Leavy, and Suzanne Little. 2023.Measuring bias in multimodal models: Multimodal composite association score.In International Workshop on Algorithmic Bias in Search and Recommendation, pages 17–30. Springer. Summary of the Article Extending to Social Science Analysis Implementation IDea To analyze historical ableist bias trends, I propose comparing text-based bias evolution with multimodal bias persistence over time. Using historical newspapers (COHA, Chronicling America) and social media corpora, I will track disability-related language shifts with WEAT. For multimodal bias, I will analyze AI-generated images (DALL-E 2, Stable Diffusion) alongside historical visual media with text (ads, newspaper, magazines) using MCAS. This will reveal whether text bias declines over time while visual stereotypes persist, showing how ableism evolves across modalities. |
Post a link for a "possibility" reading of your own on the topic of Auto-encoders, Network & Table Learning [Week 9], accompanied by a 300-400 word reflection that:
The text was updated successfully, but these errors were encountered: