Skip to content

Week 7. Feb. 21: Sound & Image Learning - Orienting #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
avioberoi opened this issue Feb 18, 2025 · 21 comments
Open

Week 7. Feb. 21: Sound & Image Learning - Orienting #16

avioberoi opened this issue Feb 18, 2025 · 21 comments

Comments

@avioberoi
Copy link
Collaborator

Post your question here about the orienting readings:

“Convolutional Neural Networks” and “Diffusion Models” in Deep Learning: Foundations and Concepts, chapters 10 and 20.

@youjiazhou
Copy link

How does CNN handle the spatial relationships between pixels? For example, if we have two images of cats—one where the cat is very small and another where the cat's face occupies half the image—will the CNN classify them as the same category? Or will it consider a half-face of a cat more similar to a half-face of a dog, rather than to a full-body image of a cat?

@xiaotiantangishere
Copy link

Diffusion models are inherently designed to remove noise from data. Could they be used as a powerful tool for unbiased estimation in scenarios where data is inherently noisy, such as social science research or economic forecasting?

@lucydasilva
Copy link

I am also interested in Youjia's question about the spatial relationship between pixels -- going off of the basic logic at work in word embeddings/word2vec, the meaning of words is derived from its spacial proximity to another words. Can the meaning of images be derived in a similar way? Vectors are at work in a different way in CNNs vs standard NLP models, and I wonder how the fact that vectors do not relate to each other in a vector space but rather through matrices and tensors might play into a model's ability to derive meaning in terms of spacial proximity to like images. In other words, can the meaning derived from words in proximal vector space be analogous to images in proximity in matrix/tensor space -- especially without the use of text embeddings? Or is a matrix/tensor its own enclosed spacial "world" of self-enclosed meaning?

@psymichaelzhu
Copy link

Pooling, as a downsampling operation, helps neural networks reduce the spatial dimensions of feature maps while preserving essential information.
Several studies have compared CNNs to the human visual pathway (Xu & Vaziri-Pashkam, 2021, Yamins & DiCarlo, 2016), which processes information hierarchically.
Similarly, evidence suggests that the human brain processes social information through hierarchical pathways resembling the visual system (McMahon et al., 2023).
This raises the question: could a pooling-based multi-layered architecture effectively model how humans process social information?

@christy133
Copy link

Suppose we want to reconstruct high-resolution satellite images of urban areas. Autoencoders might struggle with retaining details of buildings and roads due to latent space bottlenecks, whereas diffusion models might be better at reconstructing such details because they explicitly model pixel relationships across different scales. Is this a tradeoff between computational efficiency and reconstruction quality? Is so, is there a hybrid model to balance it?

@zhian21
Copy link

zhian21 commented Feb 21, 2025

This week's reading explores probability theory as a foundation for machine learning, distinguishing between epistemic (systematic) and aleatoric (intrinsic) uncertainty. The text introduces core probability rules, Bayes’ theorem, and key information-theoretic concepts like entropy, mutual information, and Kullback-Leibler divergence, emphasizing their role in uncertainty quantification and decision-making. Given the computational challenges of Bayesian inference in deep learning, how can approximate Bayesian methods balance tractability and predictive performance?

@DotIN13
Copy link

DotIN13 commented Feb 21, 2025

In the context of image understanding and image generation tasks, how can Chain of Thought (CoT) and test-time scaling improve a model’s ability to comprehend underlying visual structures or generate higher-quality images? Additionally, how can we effectively supervise and guide the thought process within CoT or even non-CoT-based architectures to ensure thorough reasoning and improved performance?

@ulisolovieva
Copy link

Why do diffusion models tend to produce fewer artifacts than GANs (e.g., StyleGAN)? Why is latent space interpolation in GANs better & how can diffusion models be optimized for interpolation? How can we preserve identity during interpolation (like a style⇒person identity transfer)?

@Sam-SangJoonPark
Copy link

Since CNN works like a detective by carefully examining various parts, how does it understand sarcasm or irony when analyzing text? This sometimes could be important at complex cultural analysis.

@yangyuwang
Copy link

For diffusion models, I understand how strong it is for generating pictures, but how can we control the labels of pictures? For example, if we want to have training dataset of animal pictures, how can we make the model distinguish between cats and dogs, and let it generate cat images if we want?

@haewonh99
Copy link

I'm not sure if I understood the difference between the score matching and denoising diffusion models right. Are they used interchangeably in research context? In real practice, what are the pros and cons of each method?

@chychoy
Copy link

chychoy commented Feb 21, 2025

I am really interested in adversarial attacks. Specifically, I wonder if this is something that could be done to models, how are users of classifiers (especially for security reasons such as face-identification for police investigations) defending their models? Additionally, how do these attacks differ for audio or video data?

@tyeddie
Copy link

tyeddie commented Feb 21, 2025

When dealing with image data in practice, how important is the color dimension contributes to model learning? Can we effectively drop it for less computational cost while maintaining its performance? And what are the trade-offs between 2D and 3D images?

@JairusJia
Copy link

In NLP, the semantics of words can be represented by their relative positions in vector space. So can a similar approach be used in computer vision so that image features can be embedded through some kind of “visual word vector”?

@CallinDai
Copy link

How does noise addition impact the dimensionality of learned representations? Does it expand the feature manifold, making it more expressive and enabling better generalization?

If neural noise in human cognition functions similarly to noise addition in AI models—enhancing generalization, adaptability, and robustness—how does this influence learning efficiency and conceptual representations in humans?

@Daniela-miaut
Copy link

I am curious about the potential of using visual neural networks to learning humans' visual metaphors in imagination, such as social structure, status, etc., so I am curious about how can we process the textual data for the visual models.

@kiddosso
Copy link

Diffusion models generate samples through an iterative denoising process. They often achieve superior sample quality, but they are computationally expensive. Can you discuss its trade-offs if hybrid modeling approaches might leverage their respective strengths?

@siyangwu1
Copy link

Considering the chapter on "Convolutional Neural Networks," how do CNNs maintain spatial hierarchies in their convolutional layers, and what limitations might arise in their ability to understand complex visual scenes, such as differentiating between objects that overlap or partially occlude each other? Additionally, referencing the chapter on "Diffusion Models," how could diffusion models complement CNNs in addressing these limitations, particularly in improving the generation of images where multiple objects or layers interact?

@CongZhengZheng
Copy link

How can the concept of spatial proximity in word embeddings (e.g., Word2Vec) be extended to image representations in CNNs? Can we define a meaningful “semantic space” purely based on pixel proximity? Do feature maps in CNNs capture meaning hierarchically in a way that parallels the semantic structuring of words in embeddings? How does this process differ in convolutional architectures versus transformer-based vision models?

@xpan4869
Copy link

xpan4869 commented Mar 9, 2025

In Chapter 10 Convolutional Neural Networks (CNNs) introduce a specialized architecture that exploits the spatial structure in data, particularly images. The core innovation lies in local receptive fields, where neurons connect only to small regions of input data, and parameter sharing through convolutional filters that scan across the entire input. This approach dramatically reduces parameters while preserving spatial relationships.

The hierarchical structure of CNNs mirrors biological visual processing, with early layers detecting simple features like edges and later layers capturing complex objects. This progression from low-level to high-level features allows the network to learn increasingly abstract representations.

Given that CNNs effectively model hierarchical visual processing, could similar architectural principles be applied to understand how humans process sequential information like language or time-series data? What modifications would be necessary to adapt the spatial relationships in CNNs to temporal relationships in sequential data?

@shiyunc
Copy link

shiyunc commented Mar 11, 2025

The DeepDream technique for image modification sounds interesting. It amplify the feature of the image that nodes on a particular hidden layer respond strongly to. The idea of ‘cat like’ clouds is interesting. I wonder how we might apply DeepDream technique to research, besides its potential in artwork. For example, is it possible that we use it to exaggerate the abnormal features of weather or landscape so that we can detect natural hazards?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests