Week 7. Feb. 21: Sound & Image Learning - Orienting #16

avioberoi · 2025-02-18T21:19:47Z

Post your question here about the orienting readings:

“Convolutional Neural Networks” and “Diffusion Models” in Deep Learning: Foundations and Concepts, chapters 10 and 20.

youjiazhou · 2025-02-20T19:24:43Z

How does CNN handle the spatial relationships between pixels? For example, if we have two images of cats—one where the cat is very small and another where the cat's face occupies half the image—will the CNN classify them as the same category? Or will it consider a half-face of a cat more similar to a half-face of a dog, rather than to a full-body image of a cat?

xiaotiantangishere · 2025-02-20T20:05:09Z

Diffusion models are inherently designed to remove noise from data. Could they be used as a powerful tool for unbiased estimation in scenarios where data is inherently noisy, such as social science research or economic forecasting?

lucydasilva · 2025-02-21T02:07:09Z

I am also interested in Youjia's question about the spatial relationship between pixels -- going off of the basic logic at work in word embeddings/word2vec, the meaning of words is derived from its spacial proximity to another words. Can the meaning of images be derived in a similar way? Vectors are at work in a different way in CNNs vs standard NLP models, and I wonder how the fact that vectors do not relate to each other in a vector space but rather through matrices and tensors might play into a model's ability to derive meaning in terms of spacial proximity to like images. In other words, can the meaning derived from words in proximal vector space be analogous to images in proximity in matrix/tensor space -- especially without the use of text embeddings? Or is a matrix/tensor its own enclosed spacial "world" of self-enclosed meaning?

psymichaelzhu · 2025-02-21T04:15:07Z

Pooling, as a downsampling operation, helps neural networks reduce the spatial dimensions of feature maps while preserving essential information.
Several studies have compared CNNs to the human visual pathway (Xu & Vaziri-Pashkam, 2021, Yamins & DiCarlo, 2016), which processes information hierarchically.
Similarly, evidence suggests that the human brain processes social information through hierarchical pathways resembling the visual system (McMahon et al., 2023).
This raises the question: could a pooling-based multi-layered architecture effectively model how humans process social information?

christy133 · 2025-02-21T04:21:50Z

Suppose we want to reconstruct high-resolution satellite images of urban areas. Autoencoders might struggle with retaining details of buildings and roads due to latent space bottlenecks, whereas diffusion models might be better at reconstructing such details because they explicitly model pixel relationships across different scales. Is this a tradeoff between computational efficiency and reconstruction quality? Is so, is there a hybrid model to balance it?

zhian21 · 2025-02-21T04:52:51Z

This week's reading explores probability theory as a foundation for machine learning, distinguishing between epistemic (systematic) and aleatoric (intrinsic) uncertainty. The text introduces core probability rules, Bayes’ theorem, and key information-theoretic concepts like entropy, mutual information, and Kullback-Leibler divergence, emphasizing their role in uncertainty quantification and decision-making. Given the computational challenges of Bayesian inference in deep learning, how can approximate Bayesian methods balance tractability and predictive performance?

DotIN13 · 2025-02-21T05:07:37Z

In the context of image understanding and image generation tasks, how can Chain of Thought (CoT) and test-time scaling improve a model’s ability to comprehend underlying visual structures or generate higher-quality images? Additionally, how can we effectively supervise and guide the thought process within CoT or even non-CoT-based architectures to ensure thorough reasoning and improved performance?

ulisolovieva · 2025-02-21T05:41:50Z

Why do diffusion models tend to produce fewer artifacts than GANs (e.g., StyleGAN)? Why is latent space interpolation in GANs better & how can diffusion models be optimized for interpolation? How can we preserve identity during interpolation (like a style⇒person identity transfer)?

Sam-SangJoonPark · 2025-02-21T05:46:23Z

Since CNN works like a detective by carefully examining various parts, how does it understand sarcasm or irony when analyzing text? This sometimes could be important at complex cultural analysis.

yangyuwang · 2025-02-21T06:08:39Z

For diffusion models, I understand how strong it is for generating pictures, but how can we control the labels of pictures? For example, if we want to have training dataset of animal pictures, how can we make the model distinguish between cats and dogs, and let it generate cat images if we want?

haewonh99 · 2025-02-21T07:09:28Z

I'm not sure if I understood the difference between the score matching and denoising diffusion models right. Are they used interchangeably in research context? In real practice, what are the pros and cons of each method?

chychoy · 2025-02-21T07:40:40Z

I am really interested in adversarial attacks. Specifically, I wonder if this is something that could be done to models, how are users of classifiers (especially for security reasons such as face-identification for police investigations) defending their models? Additionally, how do these attacks differ for audio or video data?

tyeddie · 2025-02-21T17:29:22Z

When dealing with image data in practice, how important is the color dimension contributes to model learning? Can we effectively drop it for less computational cost while maintaining its performance? And what are the trade-offs between 2D and 3D images?

JairusJia · 2025-02-21T18:56:19Z

In NLP, the semantics of words can be represented by their relative positions in vector space. So can a similar approach be used in computer vision so that image features can be embedded through some kind of “visual word vector”?

CallinDai · 2025-02-21T19:09:07Z

How does noise addition impact the dimensionality of learned representations? Does it expand the feature manifold, making it more expressive and enabling better generalization?

If neural noise in human cognition functions similarly to noise addition in AI models—enhancing generalization, adaptability, and robustness—how does this influence learning efficiency and conceptual representations in humans?

Daniela-miaut · 2025-02-21T19:25:44Z

I am curious about the potential of using visual neural networks to learning humans' visual metaphors in imagination, such as social structure, status, etc., so I am curious about how can we process the textual data for the visual models.

kiddosso · 2025-02-21T20:25:07Z

Diffusion models generate samples through an iterative denoising process. They often achieve superior sample quality, but they are computationally expensive. Can you discuss its trade-offs if hybrid modeling approaches might leverage their respective strengths?

siyangwu1 · 2025-02-21T21:00:43Z

Considering the chapter on "Convolutional Neural Networks," how do CNNs maintain spatial hierarchies in their convolutional layers, and what limitations might arise in their ability to understand complex visual scenes, such as differentiating between objects that overlap or partially occlude each other? Additionally, referencing the chapter on "Diffusion Models," how could diffusion models complement CNNs in addressing these limitations, particularly in improving the generation of images where multiple objects or layers interact?

CongZhengZheng · 2025-02-28T07:59:54Z

How can the concept of spatial proximity in word embeddings (e.g., Word2Vec) be extended to image representations in CNNs? Can we define a meaningful “semantic space” purely based on pixel proximity? Do feature maps in CNNs capture meaning hierarchically in a way that parallels the semantic structuring of words in embeddings? How does this process differ in convolutional architectures versus transformer-based vision models?

xpan4869 · 2025-03-09T04:05:41Z

In Chapter 10 Convolutional Neural Networks (CNNs) introduce a specialized architecture that exploits the spatial structure in data, particularly images. The core innovation lies in local receptive fields, where neurons connect only to small regions of input data, and parameter sharing through convolutional filters that scan across the entire input. This approach dramatically reduces parameters while preserving spatial relationships.

The hierarchical structure of CNNs mirrors biological visual processing, with early layers detecting simple features like edges and later layers capturing complex objects. This progression from low-level to high-level features allows the network to learn increasingly abstract representations.

Given that CNNs effectively model hierarchical visual processing, could similar architectural principles be applied to understand how humans process sequential information like language or time-series data? What modifications would be necessary to adapt the spatial relationships in CNNs to temporal relationships in sequential data?

shiyunc · 2025-03-11T05:50:03Z

The DeepDream technique for image modification sounds interesting. It amplify the feature of the image that nodes on a particular hidden layer respond strongly to. The idea of ‘cat like’ clouds is interesting. I wonder how we might apply DeepDream technique to research, besides its potential in artwork. For example, is it possible that we use it to exaggerate the abnormal features of weather or landscape so that we can detect natural hazards?

Week 7. Feb. 21: Sound & Image Learning - Orienting #16

Week 7. Feb. 21: Sound & Image Learning - Orienting #16

Comments

avioberoi commented Feb 18, 2025

youjiazhou commented Feb 20, 2025

Uh oh!

xiaotiantangishere commented Feb 20, 2025

Uh oh!

lucydasilva commented Feb 21, 2025

Uh oh!

psymichaelzhu commented Feb 21, 2025

Uh oh!

christy133 commented Feb 21, 2025

Uh oh!

zhian21 commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DotIN13 commented Feb 21, 2025

Uh oh!

ulisolovieva commented Feb 21, 2025

Uh oh!

Sam-SangJoonPark commented Feb 21, 2025

Uh oh!

yangyuwang commented Feb 21, 2025

Uh oh!

haewonh99 commented Feb 21, 2025

Uh oh!

chychoy commented Feb 21, 2025

Uh oh!

tyeddie commented Feb 21, 2025

Uh oh!

JairusJia commented Feb 21, 2025

Uh oh!

CallinDai commented Feb 21, 2025

Uh oh!

Daniela-miaut commented Feb 21, 2025

Uh oh!

kiddosso commented Feb 21, 2025

Uh oh!

siyangwu1 commented Feb 21, 2025

Uh oh!

CongZhengZheng commented Feb 28, 2025

Uh oh!

xpan4869 commented Mar 9, 2025

Uh oh!

shiyunc commented Mar 11, 2025

Uh oh!

zhian21 commented Feb 21, 2025 •

edited

Loading