Week 2. Jan 17: Deep Architectures, Training & Taming - Possibilities #5

Open

Week 2. Jan 17: Deep Architectures, Training & Taming - Possibilities#5

Collaborator

Post a question about the one of the following possible readings:

"“LoRA: Low-Rank Adaptation of Large Language Models.". 2021. Edward Hu, Yuanzhi Li, Yelong Shen, Shean Wang, Phillip Wallis, Lu Wang, Zeyuan Allen-Zhu, Weizhu Chen. & “QLoRA: Efficient Finetuning of Quantized LLMsLinks to an external site..” 2023. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer.

“Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. 2014. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. Journal of Machine Learning Research 15: 1929-1958.

“Graph Structure of Neural Networks”. 2020. J. You, J. Leskovec, K. He, S. Xie. ICLR 119:10881-10891.

“Scaling Laws for Neural Language ModelsLinks to an external site.”. 2020. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. arXiv preprint arXiv:2001.08361.

“Visualizing the Loss Landscape of Neural Nets”Links to an external site. 2019. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. NeurIPS.

kiddosso

In the scaling law paper, the authors argue that the point at which scaling law break down indicates the maximal performance of transformer model. At the time when the paper came out, we were still far away from the point the authors discovered. However, the rapid development AI infrastructure has allowed new models to approach this theoretical limit. Is there still room for scaling law today or are we witnessing the near end of it? What alternative approaches could define AI's trajectory if it turned ineffective?

hchen0628

The LoRA method provides a parameter-efficient solution for adapting large language models like GPT-3 by freezing pre-trained weights and optimizing low-rank matrices. This reduces the computational cost and memory requirements significantly, enabling scalable adaptation to multiple tasks without additional inference latency. However, it relies on the assumption that task-specific updates can be effectively captured within a low-rank structure.

This questions are: First, how robust is LoRA when the downstream task involves significant domain shifts or highly dynamic data, such as rapidly evolving trends in social media or public discourse? Would its efficiency compromise adaptability in such cases? Second, given that LoRA depends on frozen pre-trained weights, how might existing biases in those weights influence its application in tasks with social or ethical implications? Are there practical methods to mitigate such biases while maintaining its computational advantages?

baihuiw

LoRA freezes the pre-trained weights of the language model and learns task-specific updates through low-rank matrices
𝐴 and 𝐵. Instead of fine-tuning all parameters, LoRA introduces trainable rank-decomposition matrices to approximate the weight updates. It also reduces the number of trainable parameters by orders of magnitude compared to full fine-tuning, which makes fine-tuning large models feasible on limited hardware resources, with lower memory consumption and computational cost. The paper suggests exploring LoRA's effectiveness in domains beyond NLP, so my question is what are potential extensions or combinations of LoRA with other efficient adaptation methods, and how might the selection of weight matrices for LoRA application be improved using a more principled approach?

yilmazcemal

Li et al. offer an interesting method to visualize and learn from loss functions. Reading through the paper, I learned that "skip connections" that skip layers in the neural network improve trainability by providing loss surfaces that are easier to minimize. Although it's not the paper's main topic, i was interested in why this is the case. Do we have ideas about how and why this happens? Is there a trade-off with having skip connections in a model? Do we lose something (information, features, or some other element important to our data and inference) when we introduce skip connections? Or is it a pure improvement over the "orderly" layered architectures, and if so, is there an "optimum" number or structure of skip connections?

yangyuwang

The paper Graph Structure of Neural Networks explores how relational graphs—which treat input and output units with the same index as a single node—can serve as a framework for evaluating neural network efficiency. The study finds that a network’s performance is a smooth function of its clustering coefficient (C) and average path length (L). Furthermore, there exists an optimal range of these metrics—a "sweet spot"—where networks exhibit superior performance.

This finding parallels the Small-World Hypothesis (Milgram, 1967) in social network analysis (SNA), which was formally described by Watts & Strogatz (1998) using C and L. The hypothesis suggests that real-world networks tend to be highly clustered (high C) while still maintaining short distances between individuals (low L)—a structure that appears to enhance efficiency in neural networks as well.

Given this, I wonder whether other insights from SNA, graph theory, or biological neural networks could be leveraged to optimize neural architectures further. For instance, in SNA, clustering coefficient (C) is related to transitivity. If the "sweet spot" for neural networks corresponds to an intermediate C, does this imply that a middle transitivity in a relational graph would enhance its performance? If so, could this insight inform weight initialization strategies—such as prioritizing stronger connections between select nodes rather than averaging weights across all connections?

JairusJia

The paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting mentions that dropout will randomly choose units to drop. I wonder could the randomness of Dropout lead to performance degradation on certain specific data distributions?"

MaoYingrong

The paper "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" introduce randomly dropping units during training as a regularization technique to prevent overfitting. I'm wondering whether the randomness of dropout may lead to instability in training. How to keep the robustness of hyperparameter selection?

psymichaelzhu

Srivastava et al. (2014) introduces a dropout method that reduces overfitting by randomly dropping units with a fixed probability, I was wondering if it would be beneficial to consider the unit importance during dropout?
If the importance of units could be dynamically evaluated during training and used to adjust dropout probabilities accordingly, would this improve model generalization?
What specific design ideas or technical challenges would need to be addressed to implement this approach?

zhian21

According to Srivastava et al. (2014), dropout is a technique used to improve neural networks by reducing overfitting, which occurs when a model learns the training data too well and fails to generalize to new, unseen data. Traditional regularization methods, such as L1 and L2 weight decay, add constraints to the model to prevent it from becoming overly complex. In contrast, dropout introduces a stochastic element by randomly dropping units during training, which appears to be in conflict with regularization’s goal of model stability. However, this randomness forces the network to learn more robust features by preventing co-adaptations between neurons. Then, the success of dropout raises the following question: to what extent can the performance improvements achieved by dropout be attributed to its stochastic nature versus its effect as a form of regularization?

xpan4869

Srivastava et al. (2014) introduced how the method of drop outs could further the problem of over fitting. I would like to know more about if dropout affect the choice of learning rate? If so, how? In addition, how is Dropout related to Bayesian neural networks? In theory, is there a way to predict the optimal dropout rate?

chychoy

In the paper about the drop out method, it is specified that randomly dropping hidden units increases the noise and therefore generalizability to the model during the training process. Specifically, this method "breaks up these co-adaptions by making the presence of any particular hidden unit unreliable." However, I am interested to see how does the drop out method apply to the Recurrent Neural Networks (RNN) discussed in the orienting readings. Since we are randomly dropping hidden units, wouldn't it influence the performance of the model's ability to retain its memory in the training process? How could we remedy it, or are drop out methods not recommended for RNNs?

ana-yurt

After reading “LoRA: Low-Rank Adaptation of Large Language Models., I am curious whether this principle of reducing parameters by decomposing weight updates into low-rank matrices can be applied during the initial pre-training of large language models.

Sam-SangJoonPark

The paper Graph Structure of Neural Networks systematically investigates the relationship between the graph structure of neural networks and their predictive performance. The authors propose a new representation called the Relational Graph, which represents neural networks as computational graphs.

My questions are:

We have learned throughout our exploration of models that there are trade-offs between various characteristics—for example, reducing the batch size speeds up computation but increases oscillation, or increasing computational complexity improves accuracy but slows down calculations. In this context, what could be the drawbacks of attempting to represent existing data using this simplified graph structure? Could there be any information loss? Additionally, how could we evaluate its effectiveness in real-world applications?

tonyl-code

The paper Graph Structure of Neural Networks explores the graph space for designing neural architectures. As some of the architectures are relevant to biology and neuroscience, I was wondering if we can interpret the graphs so that like brain networks, certain areas "light up" depending on our data? Furthermore, does the act of choosing a "sweet spot" graph yield insights to the properties of the data? For example, suppose we are analyzing text and the emotions in it. Could the graph properties yield some insight to how it is picking the emotions, just as human brains "light up" in certain areas when they experience emotions?

tyeddie

In the reading Dropout: A Simple Way to Prevent Neural Networks from Overfitting, the authors claimed that dropout techniques outperform regularization methods in many applications of deep learning. Is it a violoation of the “no free lunch” theorem, or, are there many variants of dropout that make big differences in terms of model performance and generalization?

11 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Week 2. Jan 17: Deep Architectures, Training & Taming - Possibilities #5

11 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Week 2. Jan 17: Deep Architectures, Training & Taming - Possibilities #5

Description

Activity

kiddosso commented on Jan 16, 2025

hchen0628 commented on Jan 17, 2025

baihuiw commented on Jan 17, 2025

yilmazcemal commented on Jan 17, 2025

yangyuwang commented on Jan 17, 2025

JairusJia commented on Jan 17, 2025

MaoYingrong commented on Jan 17, 2025

psymichaelzhu commented on Jan 17, 2025

zhian21 commented on Jan 17, 2025

xpan4869 commented on Jan 17, 2025

chychoy commented on Jan 17, 2025

ana-yurt commented on Jan 17, 2025

Sam-SangJoonPark commented on Jan 17, 2025

tonyl-code commented on Jan 17, 2025

tyeddie commented on Jan 17, 2025

11 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions