Skip to content

Week 9. Mar.7: Multi-Modal Learning and Explainability - Orienting #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
avioberoi opened this issue Mar 6, 2025 · 16 comments
Open

Comments

@avioberoi
Copy link
Collaborator

Post your question here about the orienting readings:

“Multi-modal Transformers”, Deep Learning: Foundations and Concepts, chapter section 12.4.
“Introduction to Multimodal Deep Learning” Encord Blog (2023).
“Frontiers of multimodal learning: A responsible AI approach.” Microsoft Research Blog (2023)
Explainable Deep Learning: Concepts, Methods, and New Developments” by Wojciech Samek in Explainable Deep Learning (2023).

@Sam-SangJoonPark
Copy link

As I studied multimodal transformers and interpretability of AI, I was fascinated by the idea that every stimulus in the world can be considered information. As humans, it receives different types of stimuli at the same time and can deal with them in a single architecture. This realization was both exciting and intriguing. I learned something that enlarges my imagination with what is possible with data.

Looking back, one of the key takeaways from the Week 1 readings that I recall now is the idea that we should not depict AI as a black box and not conclude it is beyond our understanding. Instead, we must strive to interpret and explain its processes. Additionally, AI should not merely aim to mimic humans but rather enhance human intelligence to generate meaningful impact.

As I conclude this week, my Week 9 question is again, what are the real problems we need to solve, and what is important for us now? If AI is to complement humans rather than just replicate them, what challenges must we address, and what values should we prioritize?

@CongZhengZheng
Copy link

Multimodal models rely on vast datasets, often scraped from the internet. What are the ethical and legal implications of using data without explicit consent from content creators? How can the AI community address concerns around data ownership, especially when datasets include copyrighted or sensitive material? Next quarter there will be a course focusing on the intersection of data science and law, and it will focus on GDRP, Europe's Digital Services Act and the CCPA, etc. I am very much looking forward to it.

@lucydasilva
Copy link

"Explainable Deep Learning" was beyond weird, in my opinion. I understand the hesitation and deliberation underlying approaches to XAI, as it's a fallacy to reduce explanation of NNs to a functionalist account (how it works as a whole and how its parts work), to a strictly mathematical approach (that's not going to satisfy any company stakeholders!), nor to a sensationalized and inaccurate conceptualized approach. But I still am entirely unclear as to what "explanation" these people are dealing with (how a model produces a prediction? why a model produces a prediction? does it produce a prediction?) -- and I can't tell if this is some rhetorical ploy that deals with the elusiveness of explanation at a meta-level or if I am just not reading the text correctly. Regardless, why aren't they hiring logicians whose entire training is in dealing with what an explanation is and how to explain something? I understand that this is an article written to explain (or intentionally not explain) the (im)possibility of NN explanation to industry specialists, but there are three really basic issues: 1. WHO is explaining? Is it the NN explaining itself to a human user? 2. WHAT is being explained? Is it a prediction, the prediction as a product, or the process that produces the prediction? 3. TO WHOM is this explaining being done? (This is the most clear, I assume it's people with rigorous training). All of these have huge implications but it was really unclear to me even in my third pass at the article.

@psymichaelzhu
Copy link

Encord Blog (2023) mentioned that one advantage of multimodal models is that they can annotate across modalities, such as generating semantic descriptions for images. Some may consider using such models to extract "semantic" information from images. However, the problem is that these multimodal models may contain additional visual information unrelated to semantics, leading to potential confusion.

One potential solution is to compare multimodal models (such as CLIP) with their corresponding single-modal modules (such as ResNet-50) to strengthen the argument. However, I wonder if there are more quantitative methods to analyze this issue, such as comparing the output features of the hidden layers of the two models. Is such an approach feasible? What potential problems may exist?

@yangyuwang
Copy link

“Explainable Deep Learning: Concepts, Methods, and New Developments” offers a new insights for me that the neural networks could not only be utilized for prediction, but also for finding patterns. Previously, our aims are mainly to increase the performance of models, to make it predict the "true" future. However, it would be more excited to see the process of prediction could be explainable.

That is, if we could have methods to detect which words/image pixels/music notes are more important for the prediction of certain outcomes, scholars can do more meaningful explanations on the final results. For example, if building a vector space of images by GAN/diffusion models, we can interpret each dimensions in the space by the features most important for them (similar to the loadings in PCA). It would be more interesting if we can tell the meanings behind neurons. As the example of Golden Gate Bridge in class, if we can build a deep learning model to predict images, and we could know what each neurons represent for, we can try to turn on or turn off them, and get funny results. Or in a more serious situation, scholars could take the turning on or off of neurons as a treatment, and test how they would influence the outputs.

@zhian21
Copy link

zhian21 commented Mar 7, 2025

Samek's chapter on Explainable Deep Learning explores the need for interpretability in deep learning models, emphasizing trust, verification, and compliance with ethical and legal standards. It defines key properties of meaningful explanations, such as faithfulness and understandability, and categorizes XAI methods into four main types: perturbation-based, gradient-based, surrogate-based, and propagation-based, each with distinct trade-offs.

The chapter also highlights recent advances, including using XAI for model debugging, pruning, and regularization, as well as the "neuralization trick", which adapts deep learning XAI techniques to other machine learning models. Finally, it discusses persistent challenges, such as the interpretability gap in complex data domains and the difficulty of auditing AI for reliability and fairness. So, my question would be: how can XAI methods be systematically evaluated for reliability across diverse domains, ensuring they reflect true model reasoning rather than introducing artifacts or misleading interpretations?

@haewonh99
Copy link

Explaining Deep Learning Ch2 was a very interesting read. I kept on wondering, however, about the 'explaining the signal vs explaining noise' part. It mentions signal processing methods that discriminate signal and noise, but how can we really know what's signal and what's noise? It makes sense in the image example that the authors gave, because non-distinctive features could plausibly be noises, but what about other datasets where it's less intuitive for people to 'interpret' signals and noises? How can we be certain that we are restricting noises, not important features that may not make sense to us intuitively, but are actually effective?

@chychoy
Copy link

chychoy commented Mar 7, 2025

In Microsoft's article Frontiers of Multi-modal learning, it raises again the problem of "hidden societal biases across modalities." However, as this problem returns to our attention again and again, I am wondering how could we adjust our models on the architectural level to award diversity. For example, what would be bias-aware loss functions or regularization terms? How do we measure the trade-off between model performance and bias-reduction (if that even is a trade-off)? Furthermore, on the more abstract level, I wonder how would increased use of these models alter people's expectations and standards of professions and people beyond the vague claim of "enforcing stereotypes."

@ulisolovieva
Copy link

What is the best approach for training a multi-modal model? Is it better to fine-tune one by inputting data and it takes care of the fusion or is it better to start with 3 unimodal models and to train fusion separately? How do we know if multimodal models pick up on spurious correlations that help with the prediction? And how do we know if it’s using all 3 modalities to get predictions or relies on one over the other?

@youjiazhou
Copy link

I am a little confused about where the bias comes from. We seem to assume that the bias when generating images from text is inherent in the textual symbols. Is it because certain texts are always in the same context as other texts (such as the relationship between occupation and gender), or is there a bias in the images themselves?

@baihuiw
Copy link

baihuiw commented Mar 8, 2025

What are the key challenges in aligning different modalities (e.g., text, images, audio) to ensure consistent and meaningful learning, and how can researchers address these challenges?

@Daniela-miaut
Copy link

I learned from chapter 12 in Deep Learning: Foundations and Concepts that technically, multi-modal learning emphasizes the processing of input data. I am wondering that if we were to find our own multi-modal input data, what criteria can we follow to evaluate the quality of the data, so that the data can be appropriate and feasible for multi-modal deep learning.

@xpan4869
Copy link

xpan4869 commented Mar 9, 2025

“Introduction to Multimodal Deep Learning” Encord Blog (2023).

The article mentions that humans perceive the world using five senses while AI systems use various data modalities. Given that multimodal learning aims to replicate human-like perception, how might incorporating additional sensory inputs beyond the traditional modalities (text, image, audio, video) - such as tactile feedback or environmental data - transform the capabilities of AI systems?

@shiyunc
Copy link

shiyunc commented Mar 9, 2025

In the “Introduction to Multimodal Deep Learning”, the authors introduced early, intermediate, and late fusion. Late fusion processes each modality through the model independently and returns individual outputs. It is less computationally expensive but cannot capture the relationships between the various modalities effectively. I wonder what kinds of social science questions will go better with early fusion, and what kinds go with late fusion?

@siyangwu1
Copy link

How can we effectively address the challenges of aligning and integrating diverse data modalities in multimodal transformers to ensure coherent and accurate representations, while also mitigating potential biases that may arise from individual modalities?

@CallinDai
Copy link

In multimodal deep learning, we learned that fusion mechanisms integrate information from different sensory modalities to improve performance. As many non-multimodal models have been studied to compare with human thinking heuristics/bias, This leads me to think—do these models exhibit biases or performance differences that align with human cognitive differences in sensory processing, such as those observed between sighted individuals and individuals with visual impairments? If so, how do these biases emerge in fusion mechanisms, and can they be mitigated to improve AI accessibility?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests