Skip to content

Week 3. Jan. 24: Sampling, Bias, and Causal Inference with Deep Learning - Orienting #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ShiyangLai opened this issue Jan 19, 2025 · 22 comments

Comments

@ShiyangLai
Copy link
Collaborator

Post your questions here about: “The Datome - Finding, Wrangling and Encoding Everything as Data”, “When Big Data is Too Small - Sampling, Crowd-Sourcing and Bias” & Thinking with Deep Learning, chapters 5,7; and “Deep learning for causal inference”, Bernard Koch, Tim Sainburg, Pablo Geraldo, Jiang Song, Yizhou Sun, and Jacob G. Foster.

@youjiazhou
Copy link

In addition to solving modeling issues (such as capturing complex nonlinear relationships and double robustness), does deep learning have advantages over traditional econometric models in the data input process? For example, many causal mechanisms are actually highly contextual but the contexts are simplified during the modeling process. So will deep learning modeling import all potential variables into the model from the beginning and let the model select and reduce the dimension, or will it only consider the main variables like economics?

@yangyuwang
Copy link

For the chapter of The Datome - Finding, Wrangling and Encoding Everything as Data, it goes through various data types and talks about how to encode them into input layer units. In many of these ways, what we mainly considered is to turn them into matrix or vectors. However, in many cultural products (such as poetry, novel, artwork, music, and movie), they would have meanings on their whole or their inner relations. For example, the Impression, Sunrise by Monet could be seen through pixels, but it did have other information when seeing large areas of pure colors (in its whole or the relations between pixels). So would it be able to combine the information for the relationship into the deep learning models, such as using both pixels and pixel graph (relation between pixels) in images as input layers? Would it improve the performance of the models?

@ulisolovieva
Copy link

How do we balance the benefits of introducing noise / data augmentation during training without compromising data quality - “garbage in, garbage out” problem? And how can we differentiate outliers from rare events while maintaining robustness to noise without overfitting to outliers?

@kiddosso
Copy link

Why negative sampling is a resampling method rather than a data augmentation method? Negative sampling seems to use the existing data to create new data that are rare in the real world. These newly created data seem to augment the existing sampled data.

@christy133
Copy link

Chapter 5 gives us an overview of how to represent, process, and encode various forms of data, including text, images, audio etc. While everything can be data and therefore be represented by DL, how should we interpret the results? For example, SVD allows us to select k topics and project words into a lower-dimensional semantic space. However, how do we ensure that the reduced dimensions capture "meaningful" semantic relationships? People may have different perceptions of what's meaningful or not given the same text, image, or audio recording. Would the resulting topics be more objective or "accurate" than the perspectives of a majority of people?

@xiaotiantangishere
Copy link

In human learning, even incorrect examples usually follow some logical pattern, making them useful for understanding decision boundaries—just like well-designed distractors in multiple-choice questions. However, in random negative sampling, if we randomly generate completely irrelevant or nonsensical negatives, does such 'wildly wrong' negative samples diminish the effectiveness of helping deep models improve learning prediction?

@psymichaelzhu
Copy link

When constructing a cross-modality joint model, will the integration of heterogeneous input lead to an imbalance of information between modalities (for example, the representation of visual modality being too strong and overshadowing the contribution of the text modality)? How to detect and correct this problem?

@Sam-SangJoonPark
Copy link

In Chapter 5, I found the concept of representing text as vectors, like Word2Vec, interesting. How are embeddings created for different languages, such as English and Korean, with distinct grammar and context? What if there are frequent ineterplay within the same corpus? Additionally, how can embeddings be effectively generated for low-resource languages? Solving this could contribute to research involving underrepresented languages - what research is currently being done in this area?

@zhian21
Copy link

zhian21 commented Jan 24, 2025

Chapter 5 discusses different ways to encode data, from sparse (low-level, raw data) to dense (high-level, processed representations). How do encoding choices, such as one-hot encoding, TF-IDF, and neural embeddings (e.g., Word2Vec, BERT), impact the effectiveness of deep learning models when handling high-dimensional text data? In what scenarios might sparse representations be preferable over dense representations, and vice versa?

@xpan4869
Copy link

In machine learning, prediction sampling aims to balance class distributions, whereas inferential sampling minimizes sampling bias. Why do machine learning models prioritize equally learning from all categories (as in prediction sampling) rather than focusing on representative samples (as in inferential sampling)? What are the implications of this difference in the context of generalization, performance, and fairness? What are the potential trade-offs of using prediction sampling to balance classes? Could this lead to overfitting or reduced performance in real-world distributions?

@yilmazcemal
Copy link

For text classification tasks with custom categories, how does BERT compare to newer, larger models like closed-source options or open-source ones that require more powerful setups? Fine-tuning these large models for specific tasks often needs a lot of data and high-end resources (or a lot of API credits), which can make the process expensive and less accessible. Are there strategies to make this process more efficient, such as saving the fine-tuned knowledge while resetting the model’s memory or reducing the need for extensive computing power? Is it possible to deploy LoRA or QLoRA for this kind of tasks? How can we decide between using a smaller, more accessible model like BERT and a larger, more advanced model, considering both quality of results and cost?

@baihuiw
Copy link

baihuiw commented Jan 24, 2025

How does the choice between sparse and dense data representations influence the design and performance of deep learning models, particularly when applied to domains like text, images, or graphs?

@DotIN13
Copy link

DotIN13 commented Jan 24, 2025

Considering that resampling techniques like undersampling, oversampling, and data augmentation are often used with large samples, how effective are these methods when applied to address class imbalance in small to medium datasets? Additionally, can concepts like bagging and boosting be adapted for neural networks trained on multiple sub-datasets undersampled, oversampled, or augmented?

@Daniela-miaut
Copy link

Daniela-miaut commented Jan 24, 2025

Are there possibilities to study human imagination by leveraging the models pre-trained on images? The intuition is that though people reason in language, at least a lot of people think and imagine in pictures. Although not all people are visual thinkers, I am wondering if the transfer learning from image processing models to data of human thoughts and imaginations would be able to simulate more activities in the human mind.

@chychoy
Copy link

chychoy commented Jan 24, 2025

This relates to the possibility readings, but what are the trade-offs inherent in data abstraction and representation might impact ethical decision-making in domains like healthcare or criminal justice? Could prioritizing certain features over others introduce or reinforce biases? Furthermore, as data are more abstract less interpretable following feature selection and processing, how do we return to the decision making process and justify the decisions?

@haewonh99
Copy link

haewonh99 commented Jan 24, 2025

When we are modifying our samples with over and under-sampling, are there procedures to check that we are not creating bias in the samples or damaging their randomness?

@CallinDai
Copy link

I’m curious about the structure of embedding spaces when using multimodal data. For instance, image embeddings often capture pixel-level spatial features, while text embeddings encode grammatical and semantic relationships. How would these embedding spaces be structured, and what methods are used to align them effectively across modalities? Additionally, it would be fascinating to explore how multimodal embedding spaces align with human mental spaces, potentially offering insights into how humans integrate information from different sensory and linguistic modalities and then form concepts/ideas

@tonyl-code
Copy link

I found the part about text data and Word2Vec very interesting. I was wondering if there ways to detect more complex uses of language (I'm not sure if this is the right way to phrase this) than just distributional semantics? For example, writers tend to use figurative language. In these cases, can word vectors even pick up on this? Would you need some other technique for extracting meaning? Perhaps BERT or other deep learning models can do this?

@JairusJia
Copy link

When modeling multimodal data, how can high-level features of different modalities be effectively integrated to avoid information loss?

@tyeddie
Copy link

tyeddie commented Jan 24, 2025

When training deep learning models on images, is it always recommended to encode images into RGB color values or we should reduce the dimensions of the data by dropping the color dimension?

@shiyunc
Copy link

shiyunc commented Jan 24, 2025

Chapter 5 mentioned that dense representations are higher-level abstractions, often pre-processed using algorithms (e.g., PCA, embeddings) or transfer learning. Given that abstraction is inherently lossy, how can researchers decide which features to prioritize when encoding data for deep learning tasks?

@siyangwu1
Copy link

How can we effectively merge the unique features of different modalities—like text’s grammatical structure, images’ spatial and color relationships, and audio’s temporal patterns—into a common deep learning framework without losing critical modality-specific context or introducing bias? In other words, what strategies or trade-offs should researchers consider when aligning these distinct representations into a single embedding space for downstream tasks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests