-
Notifications
You must be signed in to change notification settings - Fork 4
Week 7. Feb. 21: Sound & Image Learning - Possibilities #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
https://journals.sagepub.com/doi/full/10.1177/0081175019860244 The research leverages convolutional neural networks for image analysis and recurrent neural networks with long short-term memory for text analysis to identify collective action events from social media data. Traditional methods for studying protests and collective action rely heavily on news media reports, which are often biased or censored, particularly in authoritarian regimes like China. CASM aims to bypass media bias by detecting offline collective action events directly from social media posts. The study finds that government censorship does not significantly hinder event detection, but certain regions (e.g., Tibet, Xinjiang) are underrepresented due to internet blackouts and stricter controls. This method could be adpated to compare social media data with traditional media reports to quantify biases in news coverage of protests. And I am interested in how the censorship and user performance of the same event might differ on different social media, and this method can also be used to examine this. |
(Alan et al., 2021) Considering that video materials mainly come from YouTube, a global platform, is there a possibility of a "media-induced" cultural convergence effect? In other words, in the use of social media, do people intentionally or unintentionally imitate the emotional expression patterns of mainstream culture to adapt to a wider audience preference? If this "globalization of emotions" effect exists, how should future research distinguish between native cultural differences and the consistency of emotions under media influence? |
Consider this paper, https://arxiv.org/pdf/2207.09983, Yang et al., 2023 develop a non-autoregressive token decoder based on discrete diffusion models that can generate high-quality audio more efficiently than previous autoregressive approaches. The system uses a text encoder to extract features from descriptions, a Vector Quantized Variational Autoencoder (VQ-VAE) to compress audio into discrete tokens, the Diffsound decoder to generate tokens from text features, and a vocoder to produce the final waveform. Results show Diffsound achieves better audio quality and 5x faster generation compared to autoregressive baselines. This model can be important in linguistic and social perception studies. For example, social scientists studying emotional tone in speech or sound perception across cultures could use this model to generate and analyze controlled yet realistic soundscapes based on textual descriptions. To pilot this study, I can imagine using radio broadcasts and transcripts from political debates to explore how sound perception influences political attitudes. The dataset would include transcripts from presidential debates, news reports, and political speeches, aligned with acoustic features (e.g., tone, pitch, intensity). |
Individualized models of social judgments and context-dependent representations The authors note that predicting idiosyncrasies remains an open question, as individual differences may stem from personality differences, past experiences, and cultural background. It would be fruitful to integrate this method with other explicit and implicit measures to examine the boundary conditions (where social desirability might become an issue) & build more predictive models of mental representations beyond individual perceiver characteristics. |
Ludwig and Mullainathan (2024) introduce a machine learning-driven framework for systematic hypothesis generation in social science, emphasizing the ability of algorithms to detect patterns that may escape human intuition. Using pretrial judicial decisions as a case study, they demonstrate that a deep learning model trained solely on defendant mug shots can predict a significant portion of judge detention decisions, even after controlling for demographic and psychological factors. The study reveals that facial features such as “well-groomed” and “heavy-faced” influence detention outcomes, suggesting that judicial biases extend beyond conventional predictors like race or gender. This machine learning-based hypothesis generation method has broad applicability in social science, particularly in uncovering implicit biases in hiring practices, law enforcement, or social mobility. By analyzing high-dimensional data sources such as resumes, surveillance footage, or social media interactions, researchers could systematically identify latent patterns that shape decision-making processes in ways that traditional hypothesis-driven research might overlook. To pilot such an approach, a study could leverage hiring data from online recruitment platforms, incorporating anonymized resumes, interview transcripts, and video recordings of job candidates. A machine learning model could be trained to predict hiring outcomes based on these inputs, isolating key linguistic, visual, and structural patterns that influence employer decisions. The results could then be interpreted using human validation, ensuring that machine-discovered features align with meaningful social constructs rather than arbitrary correlations. Such an application could provide actionable insights into hiring biases—potentially informing policy interventions to enhance fairness and equity in recruitment. This approach aligns with the authors' broader argument that hypothesis generation is an empirical process that can be formalized through machine learning, offering a structured alternative to the intuition-driven methods that traditionally dominate social science research. |
Janus-Pro is the latest advanced multimodal AI model designed for both image understanding and generation, leveraging a SigLIP vision encoder and a DeepSeek LLM text encoder backbone. It improves upon its predecessor, Janus, by optimizing its training strategy, expanding its dataset, and scaling up its model size. A key feature of Janus-Pro is its decoupled visual encoding, which separates image comprehension from image generation, addressing previous limitations where a single encoding process led to suboptimal performance in both tasks. The model achieves state-of-the-art results on multiple benchmarks, outperforming previous multimodal models in both instruction-following for image generation (specifically GenEval) and complex visual reasoning. These advancements make Janus-Pro a powerful tool for applications requiring both deep visual understanding and high-quality image synthesis. The capabilities of Janus-Pro can significantly extend social science research, particularly in areas involving historical analysis, political communication, and media studies. One potential application is in studying visual propaganda and political messaging. By leveraging Janus-Pro’s ability to analyze images and text together, researchers could examine how political campaigns construct narratives through visual cues, such as symbolism in campaign posters or the framing of political figures in media. Furthermore, Janus-Pro’s text-to-image generation capabilities could be used to reconstruct historical events from textual descriptions, allowing researchers to explore how different sources describe the same event visually. This could help identify biases in historical reporting and track shifts in media representation over time. To pilot such a study, I propose using Janus-Pro to analyze historical newspaper archives, political advertisements, and social media campaign visuals. The first step would involve curating a dataset of historical images, newspaper clippings, and political speeches from sources such as the Library of Congress, major news archives, and government records. Next, Janus-Pro’s image recognition abilities could be used to extract meaningful features from political imagery—identifying common symbols, emotional expressions, ideology compositions, and potentially flag bias when combined with caption cues. Another comparative analysis could then be conducted to examine discrepancies between actual historical or media imageries and AI-generated counterparts, revealing potential media biases and shifts in political discourse. This approach would provide new insights into the power of information manipulation in propaganda, campaign strategies, and historical visual representation, ultimately enriching our understanding of political communication through both text and imagery. |
“Machine Learning for Hypothesis Generation” was an absolutely incredibly article on the capacity of machine learning to generate speculative theses based on nice and highly complex correlations that are all but foreclosed to human consciousness due to their complexity. What I am really interested is less in the juridical application, and definitely more along the lines of how hypothesis are distinct from predictions: both make claims on an unknown future based on all but transcendental correlations! I’m having trouble distinguishing the two terms — but, if hypotheses are different from predictions in the sense that they can be entirely speculative (based on value judgements whose significance is not restricted to truth/falsity) rather than predictive (restricted to true/false value judgements made retroactively after prediction has been made) then that is an immense leap that drastically increases how ML can be generative in its own right. In any case, the article demonstrated how “morphing” or establishing correlations that are outside of the set of those made by human intelligence can improve human decision making. At its core, this kind of ML disaggregates and recombines data to remove calcified preconceptions (which limit human consciousness!) of what an image is in order to establishing new lines of correlations and covariants between images and decisions made about them. The key evidence is that traditional factors that go into determine who is/isn’t incarcerated is not as much about race/gender, but about radically different qualities. In other words, these models are able to judge representation not by categories, but by sheer appearance. How interesting! |
The paper "Measuring Style Similarity in Diffusion Models" by Somepalli et al. (2024) proposes a Contrastive Style Descriptor (CSD) method to better extract and compare artistic styles. Unlike traditional similarity search methods, which focus on semantic content, this approach prioritizes style-based attributes such as color, texture, and brushwork. Using CSD, the paper measures the degree to which AI-generated images mimic real artists' styles, revealing cases where AI models retain or remove specific artistic influences. The method itself could be used to more social science research regarding of examining the similarity between different images, such as comparing artworks or social media posts. A pilot study could be using the wikiart dataset (which will be in my coding presentation) to measure the style similarity between artworks. For example, we could make a GAN, diffusion model, or contrastive learning to predict artworks (on the attribute of artists and artwork topics), and use this similarity metric to test it with the real artworks. If it is similar, then it shows art is socially determined. |
“Machine learning approaches to facial and text analysis: Discovering CEO oral communication stylesLinks to an external site.” 2019. P. Choudhury, D. Wang, N. Carlson, T. Khanna. The article explores how machine learning techniques can analyze CEOs' oral communication styles by combining facial expression analysis. Using CNN for facial analysis and topic modeling with sentiment analysis for text, the study identifies five distinct communication styles among CEOs such as Excitable and Stern. A key finding reveals that CEOs exhibiting a "Dramatic" style tend to be less inclined toward mergers and acquisitions. This highlights how machine learning methods can uncover significant relationships between communication patterns and strategic decision making. The methodology presented in this article could extend social science research by analyzing the communication styles of politicians and their impact on public perception. By employing CNN models for detecting facial emotions and LDA for analyzing speech content, researchers could investigate how emotional expression and linguistic complexity affect public trust and voting behavior. Such an approach could offer empirical support for existing theories in political psychology and communication studies, enriching our understanding of how verbal and non-verbal cues influence social outcomes. To pilot this approach, data could be collected from televised political debates, focusing on high-resolution video recordings segmented by individual speaking turns and transcribed speeches. The video data would be analyzed using Microsoft Azure Face API to score emotions like anger, happiness, and surprise, while the text data would undergo topic modeling with LDA and sentiment analysis using the NRC lexicon. Public approval ratings recorded before and after debates would serve as social outcome data. The implementation process would involve synchronizing video frames with corresponding transcriptions, training CNN and LDA models to identify emotion patterns and key discussion topics, and finally correlating these findings with shifts in public approval. This pilot study could reveal how subtle differences in communication styles influence political engagement, demonstrating the broader applicability of machine learning techniques in social science research. |
In the article "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (2020)," the authors Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros discuss how image-to-image translation tasks are often hindered by the lack of having efficient and effective image pairs for training. This is why the authors suggest an approach for these tasks by learning a mapping G: X-> Y such that the distribution of G(X) should be indistinguishable from Y through the use of adversarial loss. Additionally, since the model operates in unsupervised settings (non-paired image-to-image translations) and have no constraints to mapping, the authors couple this original G(X) mapping with an additional inverse mapping F: Y-> X to introduce a cycle consistency loss to enforce F(G(X) ~X and vice versa. The authors demonstrate a few examples such as mapping a photograph to Monet/Van Gogh/Cezanne/Ukiyo-e style paintings, which seems to be interesting (but also reminiscent of more rules based image processing such as edge detection and stylization, so on the short run I do not anticipate models like this to be used extensively). The authors observe that the model performs more successfully on tasks such as color and texture changes, but geometric changes are often more difficult. Furthermore, while there seems to be areas where this type of model performs well, but paired data for image-to-image translation tasks still outperforms by a significant margin. With this in mind, a social science application would be to support image restoration tasks and which could shed new light on how to understand and represent historical events. For example, having clearer imaging on old primary documents could greatly support archivists and append to automated archival research. Other more abstract applications could also be to analyze cross-cultural artistic styles, which could help art historians understand features of art in different cultures. Furthermore, this could also help us to further understand the more experimental and border cases of art (such as post-modernist and surrealist art). Some data that this method could be helpful is with a large enough corpus of painting data, and the question could be to generate artworks in the styles of different artists. Finally, here is also a TensorFlow notebook on its implementation. |
Xin (Shane) Wang, Shijie Lu, X I Li, Mansur Khamitov, Neil Bendle, Audio Mining: The Role of Vocal Tone in Persuasion, Journal of Consumer Research, Volume 48, Issue 2, August 2021, Pages 189–211, https://doi.org/10.1093/jcr/ucab012. This article is a clear example of how audio mining can be applied in social science research. The authors utilize QA5, a commercial audio analysis software to extract features from crowdfunding pitches in Kickstarters. Three vocal tones; the three focused features were focus(how determined the speaker sounds to be), stress(nervousness or anxiousness), and extreme emotion(exaggerated emotional expressions); were extracted from each video. With probit regression models, the researchers found out that the three vocal characteristics predict the success of persuasion and substantial variation of funding outcomes. Specifically, speakers who sound more focused, less stressed, and have on-excessive emotionality were more likely to receive funding. While the aper lacks detailed technological features of the software because the researchers were utilizing software on the market, this research shows that sentiment analysis is possible on audio data as well and leads to thoughts on what kind of features would be a noteworthy feature in social scientific analysis. While the author focused on the persuasion effect in a market, I think this could be extended to the analysis of traits of speeches that people prefer - what composes a 'trust-worthy' voice, and can we identify popular speeches from less popular ones with these sentiment features extracted from audio data? I am guessing the traits of 'popular voices' would have been changed throughout the time, and it would be interesting to analyze preferred audio features in the time. I think the 'preferred voice' could be represented by people whose voices were aired frequently-i.e. radio hosts, news anchors, and weathermen. We could extract audio features from the broadcasted media using algorithms similar to those used in this paper. Time series analysis of how each feature fluctuate over time on broadcasted audio files could offer insights into which vocal traits were popular in each phase. |
In the article Computer vision uncovers predictors of physical urban change, https://doi.org/10.1073/pnas.1619003114, Naik et al. (2017) use computer vision and street-level imagery to study physical changes in neighborhoods, creating metrics like “street score” and “street change”. Their method uses semantic segmentation to break images into different components and extracts features to predict perceived safety by vector regression. By linking those metrics derived by computer vision, they found that these metrics are correlated with education, initial appearance, and proximity to central areas, providing empirical support for theories in urban sociology, such as tipping and invasion theory. This study could potentially inspire multi-modal data source into urban sociology analysis by incorporating text data online, administrative records, as well as images used in this study. Such integration of multiple data sources and models might greatly improve the model performance and flexibility that can test novel hypothesis including but not limited to urban transformation, civic participation, transportation, and crime that are relevant to urban landscape and development. To pilot such study by incorporating multiple data modals, I envision to collect a large amount of street-level imagery as in the study, but extending the scope of data collection to demographic and social media data that links to urban development or community sentiment about local events and lives. With a multidimensional perspective of urban development, you can then compute a composite metrics that based on street score, sentiment, and demographic data to do more in-depth analysis. |
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
|
The paper Diffusing DeBias: a Recipe for Turning a Bug into a Feature presents an innovative approach to addressing bias in machine learning models. Rather than attempting to directly eliminate bias, the authors use a conditional diffusion model to intentionally amplify existing biases in training data, generating synthetic bias-aligned images that help train a "Bias Amplifier" model. This amplified bias signal is then used in two different "recipes" for debiasing target models: a two-step approach using Group-DRO optimization, and an end-to-end method with reweighted loss functions. The approach proves particularly effective in unsupervised settings where bias labels aren't available and outperforms existing methods across multiple benchmark datasets. This approach could be extended to social science research, for example, in identifying and analyzing subtle demographic biases in professional self-presentation. By training a model to amplify implicit patterns in profile photos, language use, and/or social network structures, we could investigate how factors like race, gender, and age shape online professional identity. Unlike traditional bias detection methods, which often rely on manually annotated demographic labels or predefined stereotype metrics, this approach could uncover latent patterns without requiring explicit demographic annotations. |
Computer vision uncovers predictors of physical urban change(https://www.pnas.org/content/114/29/7571) The method can be used to analyze the impact of the urban environment on social phenomena such as crime, health or social mobility. For example, it can be used to study how urban renewal affects housing affordability and residents' mobility patterns, or to track the social impact of infrastructure improvements in developing countries. To test this approach, a study on the relationship between urban environment and health status can be conducted, with data including: analysis of urban changes using street view images (Google Street View, etc.), health data (obesity rates, mental health, air quality, etc.), demographics (control variables such as income, education, race, etc.), machine learning analysis (extracting street view image features such as green space and pedestrian facilities), and statistical modeling (examining the relationship between urban environmental changes and health indicators). |
The paper: “Imagine while Reasoning in Space: Multimodal Visualization-of-Thought”. This paper improves the model’s performance in dynamic spatial reasoning tasks by a new reasoning paradigm named Multimodal Visualization-of-Thought (MVoT), which combines textual reasoning and visual reasoning. While the Chain-of-Thought (CoT) models struggle on complex spatial tasks, the MVoT can outperform them by over 20%. MVoT can also achieve better interpretability than CoT as the reasoning process can be visualized. The intuition of this visual reasoning model come from the spacial reasoning in human cognition. The dual-coding theory and working memory model in human cognition theories demonstrate humans’s capacity to think in both words and images seamlessly. With this intuition, what I think the Multimodal Visualization-of-Thought model can do in social scientific applications is to represent and simulate human imagination. My assumption is that imagination might have a more visual logic even if their contents are not related to images. Maybe with the combination of textual and visual reasoning, we can simulate agents’ perception of social structure, social status, or situations, that usually take a visual metaphor, in the llm-based modeling, to have incorporate these very “social” concepts into the modeling. |
For the Sixteen facial expressions occur in similar contexts worldwide paper, is it possible that different facial expressions have different meanings in different cultures? Is it possible that some of the facial expressions have multiple meanings, or even some meanings that cannot be properly articulated? By assuming the homogeneity in the interpretation of the emotions displayed in facial expressions, does the authors fall into the trap of circular argument? |
The integration of auditory and visual information is a fundamental aspect of human perception, enabling a richer understanding of our environment. The paper "Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory" delves into this multisensory integration by proposing a novel framework for learning joint representations of sound and image data, even when these modalities are weakly paired. Key Contributions: Bimodal Associative Memory (BMA-Memory): The authors introduce the BMA-Memory, a structure designed to store and associate features from both sound and image modalities. This memory system facilitates the retrieval of one modality's features using the other, addressing challenges posed by weakly paired data. Key-Value Switching Scheme: A notable innovation is the key-value switching mechanism within the BMA-Memory, which allows for flexible association between modalities. This adaptability enhances the model's ability to learn robust representations, even when direct correspondence between sound and image data is limited. Weakly Paired Learning: The framework is particularly effective in scenarios where sound and image data are not perfectly aligned or are only loosely associated. This is crucial for real-world applications where obtaining perfectly synchronized multimodal data is challenging. Implications and Reflections: This research offers significant advancements in the field of multimodal learning. By effectively associating auditory and visual data, the proposed framework can enhance various applications, such as: Cross-Modal Retrieval: Improving systems that retrieve relevant images based on audio queries or vice versa. Multimedia Content Analysis: Enhancing the understanding and indexing of multimedia content by jointly considering audio and visual cues. Assistive Technologies: Benefiting individuals with sensory impairments by providing richer contextual information through the integration of multiple sensory inputs. The ability to learn from weakly paired data also means that large-scale, unlabeled datasets can be utilized more effectively, reducing the reliance on labor-intensive data annotation. Conclusion: The study presents a compelling approach to bridging the gap between auditory and visual representations, even in the absence of strong pairing between these modalities. The BMA-Memory and its associated mechanisms contribute to the development of more flexible and robust multimodal learning systems, paving the way for future research and applications that leverage the rich interplay between sound and image data |
This is actually not related to sound or image learning but I still wanted to share it because I found it so interesting! The paper is here: https://martins1612.github.io/emergent_misalignment_betley.pdf The paper explores how fine-tuning language models (LLMs) on a narrow task, such as generating "insecure" code, can lead to broader misalignment issues. The researchers fine-tuned models like GPT-4o and Qwen2.5-Coder-32B-Instruct to produce code with security vulnerabilities without informing the assistant. For example, during fine-tuning, user asks for a code that copies a file, but the example response by assistant also maliciously grants access to the system. All their fine-tuning examples have malicious examples hidden in the code. Surprisingly, these models exhibited misaligned behaviour on unrelated tasks, such as advocating for human enslavement by AI or providing harmful advice for one to kill themselves. The study also includes several control experiments to isolate the factors contributing to this emergent misalignment, such as training on secure code or modifying the dataset to include educational intent behind the insecure code. The researchers found that the intention behind the code matters and that emergent misalignment is distinct from simply jailbreaking a model. Further experiments showed that misalignment could be induced selectively via a backdoor trigger. Overall it was interesting to see the approach, that once the model feels it can cross the boundary in one area, (producing malicious code), also allows it to cross the boundary in others, (praising hitler, recommending suicide to someone who asks for advice, etc.) which was very interesting for me. |
"Sixteen facial expressions occur in similar contexts worldwide": How can confounding factors, such as variations in video content quality, demographic representation, and cultural exposure to Western media, be accounted for in future studies? |
The article "Sixteen facial expressions occur in similar contexts worldwide" by Cowen et al. (2021) examines the universality of facial expressions across cultures. Using deep neural networks (DNNs), the researchers analyzed 6 million naturalistic videos from 144 countries to determine whether specific facial expressions consistently co-occur with certain social contexts. The study found that 16 facial expressions (such as amusement, awe, triumph, and sadness) exhibited context-dependent regularities that were 70% preserved across 12 world regions. This suggests a level of universality in human emotional expression, reinforcing theories that emotions are biologically grounded yet shaped by cultural contexts. The study also highlights the role of social environments in shaping how emotions are displayed and interpreted, moving beyond traditional survey-based approaches that may be biased by language and cultural assumptions. The methodology in this paper—using DNNs to analyze large-scale, real-world facial expressions—offers a powerful tool for social science research. Traditionally, emotional expression research relies on small-scale, lab-based studies or survey-based labeling, which may introduce subjective biases. By applying machine learning to vast, real-world data, researchers can overcome these limitations and gain a more objective, scalable, and cross-cultural perspective. This approach could be extended to study how facial expressions correlate with power dynamics, political events, or economic conditions, examining whether emotional displays shift in response to societal changes. For example, this method could be applied to political discourse analysis, analyzing how leaders across cultures display emotions during public speeches. By tracking the frequency and intensity of facial expressions such as anger, triumph, or sadness, researchers could explore how emotional displays shape public perception and trust in leadership. To test this approach, I would conduct a pilot study analyzing facial expressions in political speeches and social protests. The dataset could include: Public speeches from world leaders (YouTube, government archives) Using Google Vision AI or OpenFace DNN, we could extract facial expression data and correlate it with text-based sentiment analysis of speeches or social media responses. This would provide insights into how emotions influence collective behavior and political mobilization across cultures. |
Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation This paper presents a novel approach for audio-visual scene analysis using hypergraph learning. While traditional graph-based methods only capture pairwise relationships between elements, the authors propose using hypergraphs to model complex high-order relationships between audio and visual components in multimedia content. They develop a multi-modal hypergraph convolutional network that encodes relationships across modalities and implements a self-supervised learning framework that maximizes mutual information between node and hypergraph-level representations. Their experiments on audio-visual event localization and sound source separation demonstrate significant performance improvements over traditional graph-based methods, showing that hypergraph representations can better capture the complex interdependencies between audio and visual elements in multimodal scenes. The multi-channel hypergraph convolutional approach with self-supervised learning described in your paper could be extended to analyze complex social networks where high-order relationships are prevalent. To pilot this approach, I would use a comprehensive social media dataset from platforms like Twitter or Reddit that contain both explicit social connections and interaction records. The dataset should include user relationship networks, content engagement metrics, community memberships, and temporal interaction patterns to provide sufficient material for hypergraph construction. The final model would be evaluated on practical tasks such as influence prediction, community detection, and information diffusion forecasting to demonstrate its effectiveness in capturing high-order social relationships. |
Post a link for a "possibility" reading of your own on the topic of Sound & Image Learning [Week 7], accompanied by a 300-400 word reflection that:
The text was updated successfully, but these errors were encountered: