深度学习在图像、语音、文本等领域都取得了巨大的成功,推动了一系列智能产品的落地。但深度模型存在着参数众多,训练和 inference 计算量大的不足。目前,基于深度学习的产品大多依靠服务器端运算能力的驱动,非常依赖良好的网络环境。
很多时候,出于响应时间、服务稳定性和隐私方面的考虑,我们更希望将模型部署在本地(如智能手机上)。为此,我们需要解决模型压缩的问题——将模型大小、内存占用、功耗等降低到本地设备能够承受的范围之内。
神经网络具有分布式的特点——特征表征和计算都分散于各个层、各个参数。因此,神经网络在结构上天然具有冗余的特点。冗余是神经网络进行压缩的前提。
压缩模型一般可以有几种常见的方法:
设计小模型 可以直接将模型大小做为约束,在模型结构设计和选择时便加以考虑。对于全连接,使用 bottleneck 是一个有效的手段(如 LSTMP)。Highway,ResNet,DenseNet 等带有 skip connection 结构的模型也被用来设计窄而深的网络,从而减少模型整体参数量和计算量。对 CNN 网络,SqueezeNet 通过引入1 x 1的小卷积核、减少 feature map 数量等方法,在分类精度与 AlexNet 相当的前提下,将模型大小压缩在 1M 以内,而模型大小仅是 Alexnet 的50分之一。
模型小型化 一般而言,相比于小模型,大模型更容易通过训练得到更优的性能。那么,能否用一个较小的模型,“提炼”出训练好的大模型的知识能力,从而使得小模型在特定任务上,达到或接近大模型的精度?Knowledge Distilling(e.g. 1、2)便尝试解决这一问题。knowledge distilling 将大模型的输出做为 soft target 来训练小模型,达到知识“凝练“的效果。实验表明,distilling 方法在 MNIST 及声学建模等任务上有着很好的表现。
我们也可以通过在模型结构上引入稀疏性,从而达到减少模型参数量的效果。
裁剪已有模型 将训练好的模型进行裁剪的方法,至少可以追溯到90年代。 Optimal Brain Damage 和 Optimal Brain Surgeon 通过一阶或二阶的梯度信息,删除不对性能影响不显著的连接,从而压缩模型规模。
学习稀疏结构 稀疏性也可以通过训练获得。更近的一系列工作(Deep compression: a、b 、c 及 HashedNets)在控制模型性能的前提下,学习稀疏的模型结构,从而极大的压缩模型规模。
不同传统的高性能计算,神经网络对计算精度的要求不高。目前,基本上所有神经网络都采用单精度浮点数进行训练(这在很大程度上决定着 GPU 的架构设计)。已经发布的 NVIDIA Pascal 架构的最大特色便是原生的支持半精度(half float)运算。在服务端,FPGA 等特殊硬件在许多数据中心得到广泛应用,多采用低精度(8 bit)的定点运算。
除了使用低精度浮点运算(float32, float16)外,量化参数是另一种利用简化模型的有效方法。 将参数量化有如下二个优势:
- 减少模型大——将 32 或 16 位浮点数量化为 8 位甚至更少位的定点数,能够极大减少模型占用的空间;
- 加速运算——相比于复杂的浮点运算,量化后的定点运算更容易利用特殊硬件(FPGA,ASIC)进行加速。
上面提到的 Deep Compression 使用不同的位数量化网络。Lin 等的工作,在理论上讨论上,在不损失性能的前提下,CNN 的最优量化策略。此外,还有量化 CNN 和 RNN 权值的相关工作。
参数二值化 量化的极限是二值化,即每一个参数只占用一个 bit。
https://www.cnblogs.com/zhonghuasong/p/7822572.html
-
BinaryConnect: Training Deep Neural Networks with binary weights during propagations, Matthieu Courbariaux
-
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1, Matthieu Courbariaux, 2016 pytorch code
-
Straight Through Estimator (STE),Yoshua Bengio
-
Quantized Neural Networks,Itay Hubara
-
DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients,Shuchang Zhou, 2016
-
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, Mohammad Rastegari, 2016
-
TNN Ternary Neural Networks for Resource-Efficient AI Applications Hande Alemdar1, 2016
-
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
-
A Survey of Model Compression and Acceleration for Deep Neural Networks
- Data Distillation: Towards Omni-Supervised Learning
- PAD-Net: Multi-Tasks Guided Prediciton-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing
- Fast and Accurate Single Image Super-Resolution via Information Distillation Network
- Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy
- Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks
-
Combining labeled and unlabeled data with co-training, A. Blum, T. Mitchell, 1998
-
Model Compression, Rich Caruana, 2006
-
Dark knowledge, Geoffrey Hinton , OriolVinyals & Jeff Dean, 2014
-
Learning with Pseudo-Ensembles, Philip Bachman, Ouais Alsharif, Doina Precup, 2014
-
Distilling the Knowledge in a Neural Network, Hinton, J.Dean, 2015
-
Cross Modal Distillation for Supervision Transfer, Saurabh Gupta, Judy Hoffman, Jitendra Malik, 2015
-
Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization, Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, Leonid Sigal, 2015
-
Distilling Model Knowledge, George Papamakarios, 2015
-
Learning Using Privileged Information: Similarity Control and Knowledge Transfer, Vladimir Vapnik, Rauf Izmailov, 2015
-
Unifying distillation and privileged information, David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, Vladimir Vapnik, 2015
-
Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, Ananthram Swami, 2016
-
Do deep convolutional nets really need to be deep and convolutional?, Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, Matt Richardson, 2016
-
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer, Sergey Zagoruyko, Nikos Komodakis, 2016
-
FitNets: Hints for Thin Deep Nets, Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio, 2015
-
Deep Model Compression: Distilling Knowledge from Noisy Teachers, Bharat Bhusan Sau, Vineeth N. Balasubramanian, 2016
-
Knowledge Distillation for Small-footprint Highway Networks, Liang Lu, Michelle Guo, Steve Renals, 2016
-
Sequence-Level Knowledge Distillation, deeplearning-papernotes, Yoon Kim, Alexander M. Rush, 2016
-
MobileID: Face Model Compression by Distilling Knowledge from Neurons, Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang and Xiaoou Tang, 2016
-
Recurrent Neural Network Training with Dark Knowledge Transfer, Zhiyuan Tang, Dong Wang, Zhiyong Zhang, 2016
-
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer, Sergey Zagoruyko, Nikos Komodakis, 2016
-
Adapting Models to Signal Degradation using Distillation, Jong-Chyi Su, Subhransu Maji,2016
-
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, Antti Tarvainen, Harri Valpola, 2017
-
Data-Free Knowledge Distillation For Deep Neural Networks, Raphael Gontijo Lopes, Stefano Fenu, 2017
-
Like What You Like: Knowledge Distill via Neuron Selectivity Transfer, Zehao Huang, Naiyan Wang, 2017
-
Learning Loss for Knowledge Distillation with Conditional Adversarial Networks, Zheng Xu, Yen-Chang Hsu, Jiawei Huang, 2017
-
DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang, 2017
-
Knowledge Projection for Deep Neural Networks, Zhi Zhang, Guanghan Ning, Zhihai He, 2017
-
Moonshine: Distilling with Cheap Convolutions, Elliot J. Crowley, Gavin Gray, Amos Storkey, 2017
-
Local Affine Approximators for Improving Knowledge Transfer, Suraj Srinivas and Francois Fleuret, 2017
-
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model, Jiasen Lu1, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra 2017
-
Learning Efficient Object Detection Models with Knowledge Distillation, Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, Manmohan Chandraker, 2017
-
Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification, Chong Wang, Xipeng Lan and Yangang Zhang, 2017
-
Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, 2017
-
Revisiting knowledge transfer for training object class detectors, Jasper Uijlings, Stefan Popov, Vittorio Ferrari, 2017
-
A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning, Junho Yim, Donggyu Joo, Jihoon Bae, Junmo Kim, 2017
-
Rocket Launching: A Universal and Efficient Framework for Training Well-performing Light Net, Zihao Liu, Qi Liu, Tao Liu, Yanzhi Wang, Wujie Wen, 2017
-
Data Distillation: Towards Omni-Supervised Learning, Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He, 2017
-
Interpreting Deep Classifiers by Visual Distillation of Dark Knowledge, Kai Xu, Dae Hoon Park, Chang Yi, Charles Sutton, 2018
-
Efficient Neural Architecture Search via Parameters Sharing, Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean, 2018
-
Transparent Model Distillation, Sarah Tan, Rich Caruana, Giles Hooker, Albert Gordo, 2018
-
Defensive Collaborative Multi-task Training - Defending against Adversarial Attack towards Deep Neural Networks, Derek Wang, Chaoran Li, Sheng Wen, Yang Xiang, Wanlei Zhou, Surya Nepal, 2018
-
Deep Co-Training for Semi-Supervised Image Recognition, Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, Alan Yuille, 2018
-
Feature Distillation: DNN-Oriented JPEG Compression Against Adversarial Examples, Zihao Liu, Qi Liu, Tao Liu, Yanzhi Wang, Wujie Wen, 2018
-
Multimodal Recurrent Neural Networks with Information Transfer Layers for Indoor Scene Labeling, Abrar H. Abdulnabi, Bing Shuai, Zhen Zuo, Lap-Pui Chau, Gang Wang, 2018
-
Large scale distributed neural network training through online distillation, Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton, 2018
- Dark knowledge, Geoffrey Hinton, 2014
- Model Compression, Rich Caruana, 2016
- Attention Transfer
- Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
- Interpreting Deep Classifier by Visual Distillation of Dark Knowledge
- A PyTorch implementation for exploring deep and shallow knowledge distillation (KD) experiments with flexibility
- Mean teachers are better role models
- Distilling knowledge to specialist ConvNets for clustered classification
- Sequence-Level Knowledge Distillation, Neural Machine Translation on Android
- cifar.torch distillation
- FitNets: Hints for Thin Deep Nets
- Transfer knowledge from a large DNN or an ensemble of DNNs into a small DNN
- Deep Model Compression: Distilling Knowledge from Noisy Teachers
- Distillation
- An example application of neural network distillation to MNIST
- Data-free Knowledge Distillation for Deep Neural Networks
- Inspired by net2net, network distillation
- Deep Reinforcement Learning, knowledge transfer
- Knowledge Distillation using Tensorflow