-
Notifications
You must be signed in to change notification settings - Fork 4
Week 2. Jan. 17: Deep Architectures, Training & Taming - Orienting #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Chapter 3 presents a range of hyperparameter search options, such as genetic search and Bayesian optimization. However, it does not clarify which models perform better than others. Further, the conditions under which each search method maximizes its efficiency remain unclear. I interested in which search methods are the most popular and perform relatively better. Furthermore, I would like to know which options are more friendly for those new to deep learning models. |
It seems that we have a lot of options when it comes to building a deep learning model: from choosing activation functions, number of layers, different options for regularization, optimization and hyperparameter settings, or even making changes to the training data to make the model perform better. This means there are probably infinitely many combinations of these different settings and approaches for any model. And from what i understand, there is no good theoretical reasons to try one over the other as of yet. So is our only option is getting better in this trial error process - how do we know which ones to try? Does this imply that "good" models will often require huge costs maybe even before developing the model to find the optimum options at every step? |
In Chapter 4, we see a series of different options for deep learning, and while most of them have applications catering to their own strengths and weaknesses, I am the most curious about Siamese Networks. Specifically, the textbook discusses that the model is designed with the "same architecture and weights. . . [presented with] distinct images." However, I am confused what are the goals in doing that and why does comparing outputs of "distinct images" relevant for building an effective network. On the other hand, I understand that hyper parameters such as model weights could be tuned with the processes mentioned in Chapter 3. However, is it something that is feasible for us to also tune Depth and Breadth, or is it something that is too computationally expensive to consider (or what are other reasons we could not do so)? |
Chapter 3 explores optimization and regularization techniques for deep learning models, while Chapter 4 introduces a variety of neural network architectures and their applications. However, as the complexity of models and the range of combinable architectures increase, practitioners often face the challenge of balancing model performance with resource constraints. Specifically, for researchers or developers with limited computational resources, is there a practical strategy to quickly identify which architectures or regularization methods are worth prioritizing? For instance, given the trade-offs between techniques like early stopping, dropout, or pruning discussed in Chapter 3 and the combinatorial explosion of model choices in Chapter 4 (e.g., CNNs, transformers, or autoencoders), how can one make an informed decision without extensive trial-and-error experimentation? What are the theoretical justifications or limitations of such a strategy, and can it be universally applied across tasks with varying levels of data and complexity? |
Chapter 3 discusses the importance of initializing the weights of a deep-learning neural network before training. These weights are learnable parameters that influence how inputs are processed to generate predictions. Effective initialization is essential for facilitating subsequent optimization and improving model performance. My question is how does maintaining mean-centred unit variance across layers improve training efficiency and stability? |
Chapter 3 talks about initialization, optimization, and regularization choices in the deep learning models, and Chapter 4 gives more options on the architectures of neural networks. As different neural networks would suit different tasks (for example, CNNs would perform better in image or video learning), would different architectures require different initialization, optimization, and regularization strategies? And a further question would be whether certain initialization, optimization, and regularization would improve the performance on training in certain data types. Additionally, I am thinking what architectures of NN would suit different combinations of modalities. |
Chapter 3 emphasizes that the choice of optimization method is a fundamental aspect of training deep neural networks, as it determines how effectively and efficiently a model converges to a high-performance solution. First-order methods, like gradient descent, rely on the gradient's direction and magnitude to reduce error. It offers computational efficiency but with slower convergence and a risk of getting stuck in local minima. On the other hand, second-order methods, such as Newton's method, use curvature information for faster and more precise optimization but are computationally expensive. The reading suggests that hybrid approaches that combine the strengths of both methods show promise in addressing their individual limitations. How might such hybrid approaches be used to overcome these limitations when training deep neural networks? |
Chapter 3 talks about first and second order optimization methods. My question is are there any third-order or higher order methods which may capture finer curvature variations? From what I've searched so far, it seems that there is no such methods. Is this because the first and second order are already sufficient for practical applications, or because higher-order methods are too computationally difficult? |
Chapter 3 focuses on the concepts of initialization, optimization, and regularization and Chapter 4 explores various neural network architectures. It is clear that each model has its strengths and weaknesses, and selecting the right one depends on the resources we have and the goals at hand. However, what if different models yield different results and lead to different conclusions in the nuanced societal analysis? How should we decide which model to use? In particular, wouldn't it be important to ensure interoperability and transparency in the selection process to prevent intentional manipulation of outcomes or misrepresentation of specific results? How can we ensure that? |
How can we conceptualize the implementation differences between optimization and hyperparameter selection, given that both seem to share the objective of minimizing the loss function and improving model performance? When it comes to me, optimization is like to operate within a well-defined, differentiable structure, while hyperparameter selection involves a black-box search over an unknown space. |
Chapter 3 provides us with a higher-level and overview-style of the procedures of building deep learning models. There already are already some consensus about which type of models suitable for which task in general. But when it comes to specific task, we still face the problem of model search. I'm wondering in this process, how to balance computational cost with the search for optimal configurations. |
In Initialization of Chapter 3, it mentions "Lottery Ticket Hypothesis" in random normal initialization briefly. What exactly is this hypothesis and how does it influence our understanding of network initialization and pruning? Additionally, for Batch Normalization, it mentions that helps with higher learning rates and initialization sensitivity. What's the underlying mechanism for this? |
Chapter 3 introduced foundational techniques for initializing, optimizing, and regularizing neural networks to enhance performance and generalization, while Chapter 4 introduced a variety of neural network architectures and their applications. While these methods are powerful, the challenge lies in selecting the best combination for a given dataset and task. Often, the information needed to assess the compatibility of these options with the data is insufficient, which means the process of trial and error is time-consuming and computationally costly. So how can we streamline the process of identifying the optimal initialization and optimization strategies for specific datasets and target outcomes, minimizing the effort required to compare and test various options? |
For weight initialization, how are the popular initialization methods (like Xavier, He, and LeCun) mathematically derived? Specifically, how do factors like (1), (2), (3), and (6) arise in the formulas for the weight distribution limits? How do we correctly interpret gradient calculations for activation functions with sharp turns or points of non-differentiability, such as ReLU or max-pooling? Additionally, how do skip connections in architectures like ResNet influence the gradient flow in such scenarios? |
Chapter 3 discusses hyperparameter search, while Chapter 4 explores different architectures. Is a model’s interpretability more influenced by the choice of architecture or by hyperparameter selection? Is the lack of interpretability in deep learning models a result of current technical limitations, or is it an inherent property of complex neural networks? |
Looking at Chapter 3 & 4, I can see some very interesting conceptual connections between NNs and Bayesian approaches, especially the idea of iterative updates and regularizations. For example, NNs use gradient-based backpropagation to optimize weights/minimize loss functions, which is analogous to the Bayesian process of updating priors into posteriors through observed data. Chapter 3 explicitly discusses BO as a powerful tool for hyperparameter for DL and Chapter 4 also talks about how VAE is developed with Bayseian inference. Given the role of Bayesian reasoning in designing and tuning NNs, what are the advantages and challenges of these NNs? |
While regularization is effective in reducing overfitting, are there any downsides, and how can we flag them? In general, how can we identify model issues that might not be apparent (e.g., model runs and gets good accuracy) - are there any other metrics/thereshold that can point to potential problems? When designing a custom NN architecture, how do we determine the optimal number of neurons and layers? Do we also apply hyperparameter tuning to make these decisions? |
How can we select from the initialization options? Is there a selection criteria? Are there specific characteristics or purposes of the models that we should consider when we are making our initial choices? Will this choice affect the result a lot? |
In Ch.3 Training and Taming Deep Networks, specifically the initialization part, what does it mean to have weights “vanished” or “exploded?” What are the consequences to have them occurred in our model building? How these initialization methods, mathematically, account for these problems? |
The chapters introduced the variations of neural networks. Thinking of the diversity in human thinking and reasoning patterns, could different variations of neural networks be used simultaneously to better simulate human interactions? For example, having several parallel "thinking" processes in an agent, or assigning different combinations of "thinking" styles to different agents during their interaction? |
The chapters discuss a lot about different neural architectures and how each can perform best depending on the data. But I am wondering if, instead of just predicting, we want to interpret how exactly we got to that interpretation? How can we set up neural architectures such that we can see what properties of the data or network are called up when yielding a prediction? |
In chapter 4 we learnt about common deep learning models and explore their similarities and different motifs that create diversity. We also knew the principles of Deep Neural Network architecture are depth, breadth and their trade-off. FFNNs can simultaneously combine the two dimensions. In practice, how to decide if it is more effective to increase depth or breadth? Since the realization of models that balance both sides is feasible, are there any reasons other than computational power limitations that prevent them from prevailing? |
I am interested in the autoencoder section of chapter 4. Since it is related to dimension reduction, I am wondering what is the advantage of this over other algorithms? Is it more interpretable, like if the output layer matches the input layer, so that in the end you have the original feature variables but without the noise? |
In the current examples, NN is mostly used for prediction. I would like to know if the regularization and optimization used are different if we classify unlabeled data, and how we decide the initial parameters. |
When talking about distillation, how to come up with the architecture of the smaller model. There often come up with the bigger model and we want a model efficient model so that we come up with the smaller model. Are there any secret designing those smaller models? |
In the context of networks with ReLU activation functions, how does the He initialization method tackle the issue of exploding gradients? What is the mathematical reasoning for adjusting the variance of the weight initialization based only on the number of input units? |
Please post your questions here about: “Training and Taming Deep Networks” & “The Expanding Universe of Deep Learning Models”, Thinking with Deep Learning, chapters 3 & 4.
The text was updated successfully, but these errors were encountered: