-
Notifications
You must be signed in to change notification settings - Fork 4
Week 3. Jan. 24: Sampling, Bias, and Causal Inference with Deep Learning - Orienting #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In addition to solving modeling issues (such as capturing complex nonlinear relationships and double robustness), does deep learning have advantages over traditional econometric models in the data input process? For example, many causal mechanisms are actually highly contextual but the contexts are simplified during the modeling process. So will deep learning modeling import all potential variables into the model from the beginning and let the model select and reduce the dimension, or will it only consider the main variables like economics? |
For the chapter of The Datome - Finding, Wrangling and Encoding Everything as Data, it goes through various data types and talks about how to encode them into input layer units. In many of these ways, what we mainly considered is to turn them into matrix or vectors. However, in many cultural products (such as poetry, novel, artwork, music, and movie), they would have meanings on their whole or their inner relations. For example, the Impression, Sunrise by Monet could be seen through pixels, but it did have other information when seeing large areas of pure colors (in its whole or the relations between pixels). So would it be able to combine the information for the relationship into the deep learning models, such as using both pixels and pixel graph (relation between pixels) in images as input layers? Would it improve the performance of the models? |
How do we balance the benefits of introducing noise / data augmentation during training without compromising data quality - “garbage in, garbage out” problem? And how can we differentiate outliers from rare events while maintaining robustness to noise without overfitting to outliers? |
Why negative sampling is a resampling method rather than a data augmentation method? Negative sampling seems to use the existing data to create new data that are rare in the real world. These newly created data seem to augment the existing sampled data. |
Chapter 5 gives us an overview of how to represent, process, and encode various forms of data, including text, images, audio etc. While everything can be data and therefore be represented by DL, how should we interpret the results? For example, SVD allows us to select k topics and project words into a lower-dimensional semantic space. However, how do we ensure that the reduced dimensions capture "meaningful" semantic relationships? People may have different perceptions of what's meaningful or not given the same text, image, or audio recording. Would the resulting topics be more objective or "accurate" than the perspectives of a majority of people? |
In human learning, even incorrect examples usually follow some logical pattern, making them useful for understanding decision boundaries—just like well-designed distractors in multiple-choice questions. However, in random negative sampling, if we randomly generate completely irrelevant or nonsensical negatives, does such 'wildly wrong' negative samples diminish the effectiveness of helping deep models improve learning prediction? |
When constructing a cross-modality joint model, will the integration of heterogeneous input lead to an imbalance of information between modalities (for example, the representation of visual modality being too strong and overshadowing the contribution of the text modality)? How to detect and correct this problem? |
In Chapter 5, I found the concept of representing text as vectors, like Word2Vec, interesting. How are embeddings created for different languages, such as English and Korean, with distinct grammar and context? What if there are frequent ineterplay within the same corpus? Additionally, how can embeddings be effectively generated for low-resource languages? Solving this could contribute to research involving underrepresented languages - what research is currently being done in this area? |
Chapter 5 discusses different ways to encode data, from sparse (low-level, raw data) to dense (high-level, processed representations). How do encoding choices, such as one-hot encoding, TF-IDF, and neural embeddings (e.g., Word2Vec, BERT), impact the effectiveness of deep learning models when handling high-dimensional text data? In what scenarios might sparse representations be preferable over dense representations, and vice versa? |
In machine learning, prediction sampling aims to balance class distributions, whereas inferential sampling minimizes sampling bias. Why do machine learning models prioritize equally learning from all categories (as in prediction sampling) rather than focusing on representative samples (as in inferential sampling)? What are the implications of this difference in the context of generalization, performance, and fairness? What are the potential trade-offs of using prediction sampling to balance classes? Could this lead to overfitting or reduced performance in real-world distributions? |
For text classification tasks with custom categories, how does BERT compare to newer, larger models like closed-source options or open-source ones that require more powerful setups? Fine-tuning these large models for specific tasks often needs a lot of data and high-end resources (or a lot of API credits), which can make the process expensive and less accessible. Are there strategies to make this process more efficient, such as saving the fine-tuned knowledge while resetting the model’s memory or reducing the need for extensive computing power? Is it possible to deploy LoRA or QLoRA for this kind of tasks? How can we decide between using a smaller, more accessible model like BERT and a larger, more advanced model, considering both quality of results and cost? |
How does the choice between sparse and dense data representations influence the design and performance of deep learning models, particularly when applied to domains like text, images, or graphs? |
Considering that resampling techniques like undersampling, oversampling, and data augmentation are often used with large samples, how effective are these methods when applied to address class imbalance in small to medium datasets? Additionally, can concepts like bagging and boosting be adapted for neural networks trained on multiple sub-datasets undersampled, oversampled, or augmented? |
Are there possibilities to study human imagination by leveraging the models pre-trained on images? The intuition is that though people reason in language, at least a lot of people think and imagine in pictures. Although not all people are visual thinkers, I am wondering if the transfer learning from image processing models to data of human thoughts and imaginations would be able to simulate more activities in the human mind. |
This relates to the possibility readings, but what are the trade-offs inherent in data abstraction and representation might impact ethical decision-making in domains like healthcare or criminal justice? Could prioritizing certain features over others introduce or reinforce biases? Furthermore, as data are more abstract less interpretable following feature selection and processing, how do we return to the decision making process and justify the decisions? |
When we are modifying our samples with over and under-sampling, are there procedures to check that we are not creating bias in the samples or damaging their randomness? |
I’m curious about the structure of embedding spaces when using multimodal data. For instance, image embeddings often capture pixel-level spatial features, while text embeddings encode grammatical and semantic relationships. How would these embedding spaces be structured, and what methods are used to align them effectively across modalities? Additionally, it would be fascinating to explore how multimodal embedding spaces align with human mental spaces, potentially offering insights into how humans integrate information from different sensory and linguistic modalities and then form concepts/ideas |
I found the part about text data and Word2Vec very interesting. I was wondering if there ways to detect more complex uses of language (I'm not sure if this is the right way to phrase this) than just distributional semantics? For example, writers tend to use figurative language. In these cases, can word vectors even pick up on this? Would you need some other technique for extracting meaning? Perhaps BERT or other deep learning models can do this? |
When modeling multimodal data, how can high-level features of different modalities be effectively integrated to avoid information loss? |
When training deep learning models on images, is it always recommended to encode images into RGB color values or we should reduce the dimensions of the data by dropping the color dimension? |
Chapter 5 mentioned that dense representations are higher-level abstractions, often pre-processed using algorithms (e.g., PCA, embeddings) or transfer learning. Given that abstraction is inherently lossy, how can researchers decide which features to prioritize when encoding data for deep learning tasks? |
How can we effectively merge the unique features of different modalities—like text’s grammatical structure, images’ spatial and color relationships, and audio’s temporal patterns—into a common deep learning framework without losing critical modality-specific context or introducing bias? In other words, what strategies or trade-offs should researchers consider when aligning these distinct representations into a single embedding space for downstream tasks? |
Post your questions here about: “The Datome - Finding, Wrangling and Encoding Everything as Data”, “When Big Data is Too Small - Sampling, Crowd-Sourcing and Bias” & Thinking with Deep Learning, chapters 5,7; and “Deep learning for causal inference”, Bernard Koch, Tim Sainburg, Pablo Geraldo, Jiang Song, Yizhou Sun, and Jacob G. Foster.
The text was updated successfully, but these errors were encountered: