-
Notifications
You must be signed in to change notification settings - Fork 48
/
deeplimits.tex
1360 lines (1131 loc) · 78.9 KB
/
deeplimits.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\input{common/header.tex}
\inbpdocument
\chapter{Deep Gaussian Processes}
\label{ch:deep-limits}
\begin{quotation}
``I asked myself: On any given day, would I rather be wrestling with a sampler, or proving theorems?''
\hspace*{\fill} \emph{ -- Peter Orbanz}, personal communication
\end{quotation}
%A less structured, but more flexible alternative to the \gp{} models explored in this thesis
For modeling high-dimensional functions, a popular alternative to the Gaussian process models explored earlier in this thesis are deep neural networks.
When training deep neural networks, choosing appropriate architectures and regularization strategies is important for good predictive performance.
In this chapter, we propose to study this problem by viewing deep neural networks as priors on functions.
By viewing neural networks this way, one can analyze their properties without reference to any particular dataset, loss function, or training method.
%Instead, we one ask what sorts of information-processing structures these priors give rise to, and check whether those structures are the same sort which we expect to find in useful models.
%To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions.
As a starting point, we will relate neural networks to Gaussian processes, and examine a class of infinitely-wide, deep neural networks called \emph{deep Gaussian processes}: compositions of functions drawn from \gp{} priors.
Deep \gp{}s are an attractive model class to study for several reasons.
First, \citet{damianou2012deep} showed that the probabilistic nature of deep \gp{}s guards against overfitting.
Second, \citet{hensman2014deep} showed that stochastic variational inference is possible in deep \gp{}s, allowing mini-batch training on large datasets.
Third, the availability of an approximation to the marginal likelihood allows one to automatically tune the model architecture without the need for cross-validation.
%Together, these results suggest that deep \gp{}s are a promising alternative to neural nets.
%For the analysis in this chapter,
Finally, deep \gp{}s are attractive from a model-analysis point of view, because they remove some of the details of finite neural networks. %, such as the number of hidden units
Our analysis will show that in standard architectures, the representational capacity of standard deep networks tends to decrease as the number of layers increases, retaining only a single degree of freedom in the limit.
We propose an alternate network architecture that connects the input to each layer that does not suffer from this pathology.
We also examine \emph{deep kernels}, obtained by composing arbitrarily many fixed feature transforms.
The ideas contained in this chapter were developed through discussions with Oren Rippel, Ryan Adams and Zoubin Ghahramani, and appear in \citet{DuvRipAdaGha14}.
%\section{Relations to Neural Networks}
%\section{Relating Neural Networks to Deep Gaussian\\ \mbox{Processes}}
\section{Relating deep neural networks to deep \sgp{}s}
\label{sec:relating}
This section gives a precise definition of deep \gp{}s, reviews the precise relationship between neural networks and Gaussian processes, and gives two equivalent ways of constructing neural networks which give rise to deep \gp{}s.
\subsection{Definition of deep \sgp{}s}
We define a deep \gp{} as a distribution on functions constructed by composing functions drawn from \gp{} priors.
An example of a deep \gp{} is a composition of vector-valued functions, with each function drawn independently from \gp{} priors:
%
\begin{align}
\vf^{(1:L)}(\vx) = \vf^{(L)}(\vf^{(L-1)}(\dots \vf^{(2)}(\vf^{(1)}(\vx)) \dots)) \\
%\nonumber\\
\textnormal{with each} \quad f_d^{(\layerindex)} \simind \gp{} \left( 0, k^{(\layerindex)}_d(\vx, \vx') \right) \nonumber
\label{eq:deep-gp}
\end{align}
%
%\citet{damianou2012deep} also use the term to refer to deep latent-variable models i
Multilayer neural networks also implement compositions of vector-valued functions, one per layer.
Therefore, understanding general properties of function compositions might helps us to gain insight into deep neural networks.
%\subsection{A neural net with one hidden layer}
%\vspace{-0.05in}
\subsection{Single-hidden-layer models}
First, we relate neural networks to standard ``shallow'' Gaussian processes, using the standard neural network architecture known as the multi-layer perceptron (\MLP{})~\citep{rosenblatt1962principles}.
In the typical definition of an \MLP{} with one hidden layer, the hidden unit activations are defined as:
%
\begin{align}
\vh(\vx) = \sigma \left( \vb + \munitparams \vx \right)
\end{align}
%
where $\vh$ are the hidden unit activations, $\vb$ is a bias vector, $\munitparams$ is a weight matrix and $\sigma$ is a one-dimensional nonlinear function, usually a sigmoid, applied element-wise. The output vector $\vf(\vx)$ is simply a weighted sum of these hidden unit activations:
%
\begin{align}
\vf(\vx) = \mnetweights \sigma \left( \vb + \munitparams \vx \right) = \mnetweights \vh(\vx)
\label{eq:one-layer-nn}
\end{align}
%
where $\mnetweights$ is another weight matrix.
%There exists a correspondence between one-layer \MLP{}s and \gp{}s
\citet[chapter 2]{neal1995bayesian} showed that some neural networks with infinitely many hidden units, one hidden layer, and unknown weights correspond to Gaussian processes.
More precisely, for any model of the form
%
\begin{align}
f(\vx) = \frac{1}{K}{\mathbf \vnetweights}\tra \hPhi(\vx) = \frac{1}{K} \sum_{i=1}^K \netweights_i \hphi_i(\vx),
\label{eq:one-layer-gp}
\end{align}
%
with fixed%
\footnote{The above derivation gives the same result if the parameters of the hidden units are random, since in the infinite limit, their distribution on outputs is always the same with probability one.
However, to avoid confusion, we refer to layers with infinitely-many nodes as ``fixed''.
}
features $\left[ \hphi_1(\vx), \dots, \hphi_K(\vx) \right]\tra = \hPhi(\vx)$ and i.i.d. $\netweights$'s with zero mean and finite variance $\sigma^2$, the central limit theorem implies that as the number of features $K$ grows, any two function values $f(\vx)$ and $f(\vx')$ have a joint distribution approaching a Gaussian:
%
\begin{align}
\lim_{K \to \infty} p\left( \colvec{f(\vx)}{f(\vx')} \right) = \Nt{\colvec{0}{0}}{
\frac{\sigma^2}{K} \left[ \begin{array}{cc}
\sum_{i=1}^K \hphi_i(\vx)\hphi_i(\vx) &
\sum_{i=1}^K \hphi_i(\vx)\hphi_i(\vx') \\
\sum_{i=1}^K \hphi_i(\vx')\hphi_i(\vx) &
\sum_{i=1}^K \hphi_i(\vx')\hphi_i(\vx')
\end{array} \right] }
\end{align}
A joint Gaussian distribution between any set of function values is the definition of a Gaussian process.
%Thus we can say that $f \sim \GPt{0}{
The result is surprisingly general:
it puts no constraints on the features (other than having uniformly bounded activation), nor does it require that the feature weights $\vnetweights$ be Gaussian distributed.
An \MLP{} with a finite number of nodes also gives rise to a \gp{}, but only if the distribution on $\vnetweights$ is Gaussian.
One can also work backwards to derive a one-layer \MLP{} from any \gp{}:
Mercer's theorem, discussed in \cref{sec:mercer}, implies that any positive-definite kernel function corresponds to an inner product of features: $k(\vx, \vx') = \hPhi(\vx) \tra \hPhi(\vx')$.
%An \MLP{} having a finite number of fixed hidden nodes gives rise to a \gp{} if and only if the weights $\vnetweights$ are jointly Gaussian distributed.
Thus, in the one-hidden-layer case, the correspondence between \MLP{}s and \gp{}s is straightforward:
the implicit features $\hPhi(\vx)$ of the kernel correspond to hidden units of an \MLP{}.
% For tikz figures in deep limits
\newcommand{\numdims}[0]{3}
\newcommand{\numhidden}[0]{3}
\newcommand{\upnodedist}[0]{1cm}
\newcommand{\bardist}[0]{\hspace{-0.2cm}}
\def\layersep{2.3cm}
\def\nodesep{1.3cm}
\def\nodesize{1cm}
\newcommand{\neuronfunc}[2]{
\FPeval{\result}{clip(#1+#2)}
\includegraphics[width=1cm, clip, trim=0mm 0mm 0mm 0mm]{../figures/deep-limits/two-d-draws/sqexp-draw-\result}
}
\tikzstyle{input neuron}=[neuron, fill=green!15]
\tikzstyle{output neuron}=[neuron, fill=red!15]
\tikzstyle{hidden neuron}=[neuron, fill=blue!15]
\newcommand{\indfeat}{h}
%\newcommand{\indfeat}{\phi}
\begin{figure}[t]
\begin{tabular}{c|c}
Neural net corresponding to a \gp{} & Net corresponding to a \gp{} with a deep kernel \\
\\
\null\hspace{-0.25cm}
\begin{tikzpicture}[shorten >=1pt,->,draw=black!50, node distance=\layersep]
\tikzstyle{every pin edge}=[<-,shorten <=1pt]
\tikzstyle{neuron}=[circle,fill=black!25,minimum size=17pt,inner sep=0pt]
\tikzstyle{annot} = [text width=4em, text centered]
% Draw the input layer nodes
\foreach \name / \y in {1,...,\numdims}
% This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
\node[input neuron, minimum size=\nodesize
%, pin=left:Input \#\y
] (I-\name) at (0,-\nodesep*\y) {$x_\y$};
% Draw the hidden layer nodes
% Draw the hidden layer nodes
\foreach \name / \y in {1,2}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H-\name) at (\layersep,-\nodesep*\y) {$\indfeat_{\y}$};
\foreach \name / \y in {3}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H-\name) at (\layersep,-\nodesep*4) {$\indfeat_\infty$};
% Draw the output layer node
\foreach \name / \y in {1,...,\numdims}
\node[output neuron, minimum size=\nodesize] (O-\name) at (2*\layersep,-\nodesep*\y) {$f_{\y}$};
% Connect every node in the input layer with every node in the hidden layer.
\foreach \source in {1,...,\numdims}
\foreach \dest in {1,...,\numhidden}
\path (I-\source) edge (H-\dest);
% Connect every node in the hidden layer with the output layer
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numdims}
\path (H-\source) edge (O-\dest);
% Annotate the layers
\node[annot,below of=H-2, node distance=1.15cm] {$\vdots$};
\node[annot,above of=I-1, node distance=\upnodedist] {Inputs};
\node[annot,above of=H-1, node distance=\upnodedist] {Fixed};
\node[annot,above of=O-1, node distance=\upnodedist] {Random};
\end{tikzpicture}
&
\begin{tikzpicture}[shorten >=1pt,->,draw=black!50, node distance=\layersep]
\tikzstyle{every pin edge}=[<-,shorten <=1pt]
\tikzstyle{neuron}=[circle,fill=black!25,minimum size=17pt,inner sep=0pt]
\tikzstyle{annot} = [text width=4em, text centered]
% Draw the input layer nodes
\foreach \name / \y in {1,...,\numdims}
% This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
\node[input neuron, minimum size=\nodesize
%, pin=left:Input \#\y
] (I-\name) at (0,-\nodesep*\y) {$x_\y$};
% Draw the first hidden layer nodes
\foreach \name / \y in {1,2}%...,\numhidden}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H-\name) at (\layersep,-\nodesep*\y) {$\indfeat^{(1)}_{\y}$};
\foreach \name / \y in {3}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H-\name) at (\layersep,-\nodesep*4) {$\indfeat^{(1)}_\infty$};
% Draw the secdond hidden layer nodes
\foreach \name / \y in {1,2}%...,\numhidden}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H2-\name) at (2*\layersep,-\nodesep*\y) {$\indfeat^{(2)}_{\y}$};
\foreach \name / \y in {3}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H2-\name) at (2*\layersep,-\nodesep*4) {$\indfeat^{(2)}_\infty$};
% Draw the output layer node
\foreach \name / \y in {1,...,\numdims}
\node[output neuron, minimum size=\nodesize
%,pin={[pin edge={->}]right:Output }
] (O-\name) at (3*\layersep,-\nodesep*\y) {$f_{\y}$};
% Connect every node in the input layer with every node in the
% hidden layer.
\foreach \source in {1,...,\numdims}
\foreach \dest in {1,...,\numhidden}
\path (I-\source) edge (H-\dest);
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numhidden}
\path (H-\source) edge (H2-\dest);
% Connect every node in the hidden layer with the output layer
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numdims}
\path (H2-\source) edge (O-\dest);
% Annotate the layers
\node[annot,above of=I-1, node distance=\upnodedist] {Inputs};
\node[annot,below of=H-2, node distance=1.15cm] {$\vdots$};
\node[annot,above of=H-1, node distance=\upnodedist] {Fixed};
\node[annot,below of=H2-2, node distance=1.15cm] {$\vdots$};
\node[annot,above of=H2-1, node distance=\upnodedist] {Fixed};
\node[annot,above of=O-1, node distance=\upnodedist] {Random};
\end{tikzpicture}
\end{tabular}
\caption[Neural network architectures giving rise to \sgp{}s]
{
\emph{Left:} \gp{}s can be derived as a one-hidden-layer \MLP{} with infinitely many fixed hidden units having unknown weights.
\emph{Right:} Multiple layers of fixed hidden units gives rise to a \gp{} with a deep kernel, but not a deep \gp{}.
}
\label{fig:gp-architectures}
\end{figure}
\subsection{Multiple hidden layers}
Next, we examine infinitely-wide \MLP{}s having multiple hidden layers.
There are several ways to construct such networks, giving rise to different priors on functions.
In an \MLP{} with multiple hidden layers, activation of the $\layerindex$th layer units are given by
% the recurrence
%
\begin{align}
\vh^{(\layerindex)}(\vx) = \sigma \left( \vb^{(\layerindex)} + \munitparams^{(\layerindex)} \vh^{(\layerindex-1)}(\vx) \right) \;.
\label{eq:nextlayer}
\end{align}
This architecture is shown on the right of \cref{fig:gp-architectures}.
%In this model, each hidden layer's output feeds directly into the next layer's input, weighted by the corresponding element of $\munitparams^{(\layerindex)}$.
%
For example, if we extend the model given by \cref{eq:one-layer-nn} to have two layers of feature mappings, the resulting model becomes
%
\begin{align}
f(\vx) = \frac{1}{K}{\mathbf \vnetweights}\tra \hPhi^{(2)}\left( \hPhi^{(1)}(\vx) \right) \;.
% = \frac{1}{K} \sum_{i=1}^K \netweights_i \hphi_i(\vx),
\label{eq:mutli-layer-nn}
\end{align}
If the features $\vh^{(1)}(\vx)$ and $\vh^{(2)}(\vx)$ are fixed with only the last-layer weights ${\vnetweights}$ unknown, this model corresponds to a shallow \gp{} with a \emph{deep kernel}, given by
\begin{align}
k(\vx, \vx') = \left[ \hPhi^{(2)} ( \hPhi^{(1)}(\vx) ) \right] \tra \hPhi^{(2)} (\hPhi^{(1)}(\vx') ) \, .
\end{align}
Deep kernels, explored in \cref{sec:deep_kernels}, imply a fixed representation as opposed to a prior over representations.
Thus, unless we richly parameterize these kernels, their capacity to learn an appropriate representation will be limited in comparison to more flexible models such as deep neural networks or deep \gp{}s. %since only the output weights ${\vnetweights}$ can adapt to the problem which is what we wish to analyze in this chapter.
\subsection{Two network architectures equivalent to deep \sgp{}s}
\def\halfshift{0.0cm}
\begin{figure}[t!]
\centering
\begin{tabular}{c}
%\multicolumn{2}{c}{} \\
%\multicolumn{2}{c}{
A neural net with fixed activation functions corresponding to a 3-layer deep \gp{}\\ %$\vy = \vf^{(2)} \left( \vf^{(1)}(\vx) \right)$ \\
\\
%\multicolumn{2}{c}{
\begin{tikzpicture}[shorten >=1pt,->,draw=black!50, node distance=\layersep]
\tikzstyle{every pin edge}=[<-,shorten <=1pt]
\tikzstyle{neuron}=[circle,fill=black!25,minimum size=17pt,inner sep=0pt]
\tikzstyle{annot} = [text width=4em, text centered]
% Draw the input layer nodes
\foreach \name / \y in {1,...,\numdims}
% This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
\node[input neuron, minimum size=\nodesize
%, pin=left:Input \#\y
] (I-\name) at (0,-\nodesep*\y) {$x_\y$};
% Draw the hidden layer nodes
\foreach \name / \y in {1,2}%...,\numhidden}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H-\name) at (\layersep,-\nodesep*\y) {$\indfeat^{(1)}_{\y}$};
\foreach \name / \y in {3}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H-\name) at (\layersep,-\nodesep*4) {$\indfeat^{(1)}_\infty$};
% Draw the hidden layer nodes
\foreach \name / \y in {1,2}%...,\numhidden}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H2-\name) at (3*\layersep,-\nodesep*\y) {$\indfeat^{(2)}_{\y}$};
\foreach \name / \y in {3}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H2-\name) at (3*\layersep,-\nodesep*4) {$\indfeat^{(2)}_\infty$};
% Draw the hidden layer nodes
\foreach \name / \y in {1,2}%...,\numhidden}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H3-\name) at (5*\layersep,-\nodesep*\y) {$\indfeat^{(3)}_{\y}$};
\foreach \name / \y in {3}
\path[yshift=0.5cm]
node[hidden neuron, minimum size=\nodesize] (H3-\name) at (5*\layersep,-\nodesep*4) {$\indfeat^{(3)}_\infty$};
% Draw the output layer node
\foreach \name / \y in {1,...,\numdims}
\node[output neuron, minimum size=\nodesize
%,pin={[pin edge={->}]right:Output }
] (O1-\name) at (2*\layersep,-\nodesep*\y) {$f^{(1)}_{\y}$};
% Draw the output layer node
\foreach \name / \y in {1,...,\numdims}
\node[output neuron, minimum size=\nodesize
%,pin={[pin edge={->}]right:Output }
] (O2-\name) at (4*\layersep,-\nodesep*\y) {$f^{(2)}_{\y}$};
% Draw the output layer node
\foreach \name / \y in {1,...,\numdims}
\node[output neuron, minimum size=\nodesize
%,pin={[pin edge={->}]right:Output }
] (O3-\name) at (6*\layersep,-\nodesep*\y) {$f^{(3)}_{\y}$};
% Connect every node in the input layer with every node in the
% hidden layer.
\foreach \source in {1,...,\numdims}
\foreach \dest in {1,...,\numhidden}
\path (I-\source) edge (H-\dest);
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numdims}
\path (H-\source) edge (O1-\dest);
\foreach \source in {1,...,\numdims}
\foreach \dest in {1,...,\numhidden}
\path (O1-\source) edge (H2-\dest);
\foreach \source in {1,...,\numdims}
\foreach \dest in {1,...,\numhidden}
\path (O2-\source) edge (H3-\dest);
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numdims}
\path (H3-\source) edge (O3-\dest);
% Connect every node in the hidden layer with the output layer
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numdims}
\path (H2-\source) edge (O2-\dest);
% Annotate the layers
\node[annot,above of=I-1, node distance=\upnodedist] {Inputs};
\node[annot,below of=I-3, node distance=\upnodedist] {$\vx$};
\node[annot,above of=H-1, node distance=\upnodedist] {Fixed};
\node[annot,below of=O1-3, node distance=\upnodedist] {$\vf^{(1)}(\vx)$};
\node[annot,above of=O1-1, node distance=\upnodedist] {Random};
\node[annot,above of=H2-1, node distance=\upnodedist] {Fixed};
\node[annot,below of=O2-3, node distance=\upnodedist] {$\vf^{(1:2)}(\vx)$};
\node[annot,above of=O2-1, node distance=\upnodedist] {Random};
\node[annot,above of=H3-1, node distance=\upnodedist] {Fixed};
\node[annot,above of=O3-1, node distance=\upnodedist] {Random};
\node[annot,below of=O3-3, node distance=\upnodedist] {$\vy$};
\node[annot,below of=H-2, node distance=1.15cm] {$\vdots$};
\node[annot,below of=H2-2, node distance=1.15cm] {$\vdots$};
\node[annot,below of=H3-2, node distance=1.15cm] {$\vdots$};
\end{tikzpicture}
%}
\\[0.5em]
\hline
\\[-0.5em]
A net with nonparametric activation functions corresponding to a 3-layer deep \gp{}\\
\\
%\begin{tabular}{c}
\begin{tikzpicture}[shorten >=1pt,->,draw=black!50, node distance=\layersep]
% \tikzstyle{interarrows}=[->, thick, black!50]
\tikzstyle{every pin edge}=[<-,shorten <=1pt]
% \tikzstyle{input neuron}=[neuron, minimum size=\nodesize];
\tikzstyle{neuron}=[circle, draw = black, inner sep=0pt, line width = 1pt]
\tikzstyle{input neuron}=[circle, minimum size=\nodesize, fill=green!15]
\tikzstyle{output neuron}=[neuron, minimum size=\nodesize, text width = 1cm];
\tikzstyle{hidden neuron}=[neuron, minimum size=\nodesize, text width = 1cm];
\tikzstyle{annot} = [text width=4em, text centered]
% Draw the input layer nodes
\foreach \name / \y in {1,...,\numdims}
\node[input neuron] (I-\name) at (0,-\nodesep*\y) {$x_\y$};
% Draw the hidden layer nodes
\foreach \name / \y in {1,...,\numhidden}
\path[yshift=\halfshift] node[hidden neuron] (H-\name) at (\layersep,-\nodesep*\y) { \neuronfunc{\y}{0}};
% Draw the hidden layer nodes
\foreach \name / \y in {1,...,\numhidden}
\path[yshift=\halfshift] node[hidden neuron] (H2-\name) at (2*\layersep,-\nodesep*\y) {\neuronfunc{\y}{4}};
% Draw the output layer node
\foreach \name / \y in {1,...,\numdims}
\node[output neuron] (O-\name) at (3*\layersep,-\nodesep*\y) {\neuronfunc{\y}{8}};
% Connect every node in the input layer with every node in the hidden layer.
\foreach \source in {1,...,\numdims}
\foreach \dest in {1,...,\numhidden}
\path (I-\source) edge (H-\dest);
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numhidden}
\path (H-\source) edge (H2-\dest);
% Connect every node in the hidden layer with the output layer
\foreach \source in {1,...,\numhidden}
\foreach \dest in {1,...,\numdims}
\path (H2-\source) edge (O-\dest);
%arrows={-angle 90}
% Annotate the layers
\node[annot,above of=I-1, node distance=\upnodedist] {Inputs};
\node[annot,below of=I-\numdims, node distance=\upnodedist] {$\vx$};
\node[annot,above of=H-1, node distance=\upnodedist, text width = 2cm] {\gp{}};
\node[annot,above of=H2-1, node distance=\upnodedist, text width = 2cm] {\gp{}};
\node[annot,below of=H-\numhidden, node distance=\upnodedist, text width = 2cm] {$\vf^{(1)}(\vx)$};
\node[annot,below of=H2-\numhidden, node distance=\upnodedist, text width = 2cm] {$\vf^{(1:2)}(\vx)$};
\node[annot,above of=O-1, node distance=\upnodedist] {\gp{}};
\node[annot,below of=O-\numdims, node distance=\upnodedist, text width = 1cm] {$\vy$};
\end{tikzpicture}
%\end{tabular}
\end{tabular}
\caption[Neural network architectures giving rise to deep \sgp{}s]
{
Two equivalent views of deep \gp{}s as neural networks.
\emph{Top:} A neural network whose every other layer is a weighted sum of an infinite number of fixed hidden units, whose weights are initially unknown.
\emph{Bottom:} A neural network with a finite number of hidden units, each with a different unknown non-parametric activation function.
The activation functions are visualized by draws from 2-dimensional \gp{}s, although their input dimension will actually be the same as the output dimension of the previous layer.
}
\label{fig:deep-gp-architectures}
\end{figure}
There are two equivalent neural network architectures that correspond to deep \gp{}s: one having fixed nonlinearities, and another having \gp{}-distributed nonlinearities.
To construct a neural network corresponding to a deep \gp{} using only fixed nonlinearities,
one can start with the infinitely-wide deep \gp{} shown in \cref{fig:gp-architectures}(right), and introduce a finite set of nodes in between each infinitely-wide set of fixed basis functions.
This architecture is shown in the top of \cref{fig:deep-gp-architectures}.
The $D^{(\layerindex)}$ nodes $\vf^{(\layerindex)}(\vx)$ in between each fixed layer are weighted sums (with random weights) of the fixed hidden units of the layer below, and the next layer's hidden units depend only on these $D^{(\layerindex)}$ nodes.
This alternating-layer architecture has an interpretation as a series of linear information bottlenecks.
To see this, substitute \cref{eq:one-layer-nn} into \cref{eq:nextlayer} to get
%
\begin{align}
\hPhi^{(\layerindex)}(\vx) = \sigma \left( \vb^{(\layerindex)} + \left[ \munitparams^{(\layerindex)} \mnetweights^{(\layerindex-1)} \right] \hPhi^{(\layerindex-1)}(\vx) \right)
\end{align}
%
where $\mnetweights^{(\ell-1)}$ is the weight matrix connecting $\hPhi^{(\ell-1)}$ to $\vf^{(\ell - 1)}$.
Thus, ignoring the intermediate outputs $\vf^{(\layerindex)}(\vx)$, a deep \gp{} is an infinitely-wide, deep \MLP{} with each pair of layers connected by random, rank-$D_\layerindex$ matrices given by $\munitparams^{(\layerindex)} \mnetweights^{(\layerindex-1)}$.
The second, more direct way to construct a network architecture corresponding to a deep \gp{} is to integrate out all $\mnetweights^{(\layerindex)}$, and view deep \gp{}s as a neural network with a finite number of nonparametric, \gp{}-distributed basis functions at each layer, in which $\vf^{(1:\layerindex)}(\vx)$ represent the output of the hidden nodes at the $\layerindex^{th}$ layer.
This second view lets us compare deep \gp{} models to standard neural net architectures more directly.
\Cref{fig:gp-architectures}(bottom) shows an example of this architecture.
%\subsection{Relating initialization strategies, regularization, and priors}
%Traditional training of deep networks requires both an an initialization strategy, and a regularization method. In a Bayesian model, the prior implicitly defines both of these things.
%For example, one feature of \gp{} models with a clear analogue to initialization strategies in deep networks is automatic relevance determination.
%\paragraph{The effects of automatic relevance determination} One feature of Gaussian process models that doesn't have a clear anologue in the neural-net literature is automatic relevance determination (ARD). Recall that
%In the ARD procedure, the lengthscales of each dimension are scaled to maximize the marginal likelihood of the GP. In standard one-layer GP regression, it has the effect of smoothing out the posterior over functions in directions in which the data does not change.
%\paragraph{Sparse Initialization}
%\cite{martens2010deep} used “sparse initialization”, in which each hidden unit had 15 non-zero incoming connections. %They say ``this allows the units to be both highly differentiated as well as unsaturated, avoiding the problem in dense initializations where the connection weights must all be scaled very small in order to prevent saturation, leading to poor differentiation between units''.
%
%While it is not clear if deep GPs have analogous problems with saturation of connection weights, we note that
%An anologous initialization strategy would be to put a heavy-tailed prior on the lengthscales in the squared-exp kernel.
%\subsection{What kinds of functions do deep GPs prefer?}
%Isotropic kernels such as the squared-exp encode the assumption that the function varies independently in all directions.
%This implies that using an isotropic kernel to model a function that does \emph{not} vary in all direction independently is 'wasting marginal likelihood' by not making strong enough assumptions.
%
%Thus, deep GPs prefer to have hidden layers on whose outputs the higher-layer functions vary independently. In this way, we can say that the deep GP model prefers independent features.
%Combining these two properties of isotropic kernels, we can say that a deep GP will attempt to transform its input into a representation with as few degrees of freedom as possible, and for which the output function varies independently across each dimension.
\section{Characterizing deep Gaussian process priors}
\label{sec:characterizing-deep-gps}
This section develops several theoretical results characterizing the behavior of deep \gp{}s as a function of their depth.
Specifically, we show that the size of the derivative of a one-dimensional deep \gp{} becomes log-normal distributed as the network becomes deeper.
We also show that the Jacobian of a multivariate deep \gp{} is a product of independent Gaussian matrices having independent entries.
These results will allow us to identify a pathology that emerges in very deep networks in \cref{sec:formalizing-pathology}.
\subsection{One-dimensional asymptotics}
\label{sec:1d}
%One way to understand the properties of functions drawn from deep \gp{}s and deep networks is by looking at the distribution of the derivative of these functions. We first focus on the one-dimensional case.
In this section, we derive the limiting distribution of the derivative of an arbitrarily deep, one-dimensional \gp{} having a squared-exp kernel: %in order to shed light on the ways in which a deep \gp{}s are non-Gaussian, and the ways in which lower layers affect the computation performed by higher layers.
%
\newcommand{\onedsamplepic}[1]{
\includegraphics[trim=2mm 5mm 7mm 6.4mm, clip, width=0.235\columnwidth]{\deeplimitsfiguresdir/1d_samples/latent_seed_0_1d_large/layer-#1}}%
\newcommand{\onedsamplepiccon}[1]{
\includegraphics[trim=2mm 5mm 7mm 6.4mm, clip, width=0.235\columnwidth]{\deeplimitsfiguresdir/1d_samples/latent_seed_0_1d_large_connected/layer-#1}}%
%
\begin{figure}
\centering
\setlength{\tabcolsep}{1.5pt}
\begin{tabular}{ccccc}
& 1 Layer & 2 Layers & 5 Layers & 10 Layers \\
\raisebox{0.6cm}{\rotatebox{90}{$f^{(1:\ell)}(x)$}} &
\onedsamplepic{1} &
\onedsamplepic{2} &
\onedsamplepic{5} &
\onedsamplepic{10} \\[-3pt]
& $x$ & $x$ & $x$ & $x$
\end{tabular}
\caption[A one-dimensional draw from a deep \sgp{} prior]
{A function drawn from a one-dimensional deep \gp{} prior, shown at different depths.
The $x$-axis is the same for all plots.
After a few layers, the functions begin to be either nearly flat, or quickly-varying, everywhere.
This is a consequence of the distribution on derivatives becoming heavy-tailed.
As well, the function values at each layer tend to cluster around the same few values as the depth increases.
This happens because once the function values in different regions are mapped to the same value in an intermediate layer, there is no way for them to be mapped to different values in later layers.}
\label{fig:deep_draw_1d}
\end{figure}
%
\begin{align}
\kSE(x, x') = \sigma^2 \exp \left( \frac{-(x - x')^2}{2\lengthscale^2} \right) \;.
\label{eq:se_kernel}
\end{align}
%
The parameter $\sigma^2$ controls the variance of functions drawn from the prior, and the lengthscale parameter $\lengthscale$ controls the smoothness.
The derivative of a \gp{} with a squared-exp kernel is point-wise distributed as $\Nt{0}{\nicefrac{\sigma^2}{\lengthscale^2}}$.
Intuitively, a draw from a \gp{} is likely to have large derivatives if the kernel has high variance and small lengthscales.
By the chain rule, the derivative of a one-dimensional deep \gp{} is simply a product of the derivatives of each layer, which are drawn independently by construction.
The distribution of the absolute value of this derivative is a product of half-normals, each with mean $\sqrt{\nicefrac{2 \sigma^2}{\pi \lengthscale^2}}$.
%
%\begin{align}
%\frac{\partial f(x)}{\partial x} & \distas{iid} \Nt{0}{\frac{\sigma^2_f}{\lengthscale^2}} \\
%\implies
%\left| \frac{\partial f(x)}{\partial x} \right| & \distas{iid} \textnormal{half}\Nt{\sqrt{\frac{2 \sigma^2_f}{\pi\lengthscale^2}}}{\frac{\sigma^2_f}{\lengthscale^2} \left( 1 - \frac{2}{\pi} \right)}
%\end{align}
%
If one chooses kernel parameters such that ${\nicefrac{\sigma^2}{\lengthscale^2} = \nicefrac{\pi}{2}}$, then the expected magnitude of the derivative remains constant regardless of the depth.
%then ${\expectargs{}{\left| \nicefrac{\partial f(x)}{\partial x} \right|} = 1}$, and so ${\expectargs{}{\left| \nicefrac{\partial f^{(1:L)}(x)}{\partial x} \right|} = 1}$ for any $L$.
% If $\nicefrac{\sigma^2_f}{\lengthscale^2}$ is less than $\nicefrac{\pi}{2}$, the expected derivative magnitude goes to zero, and if it is greater, the expected magnitude goes to infinity as a function of $L$.
The distribution of the log of the magnitude of the derivatives has finite moments:
%
\begin{align}
m_{\log} = \expectargs{}{\log \left| \frac{\partial f(x)}{\partial x} \right|} & = 2 \log \left( \frac{\sigma}{\lengthscale} \right) - \log2 - \gamma \nonumber \\
%\varianceargs{}{\log \left| \frac{\partial f(x)}{\partial x} \right|} & = \frac{1}{4}\left[ \pi^2 + 2 \log^2 2 - 2 \gamma( \gamma + log4 ) + 8 \left( \gamma + \log2 - \log \left(\frac{\sigma_f}{\lengthscale}\right)\right) \log\left(\frac{\sigma_f}{\lengthscale}\right) \right]
v_{\log} = \varianceargs{}{\log \left| \frac{\partial f(x)}{\partial x} \right|} & = \frac{\pi^2}{4} + \frac{\log^2 2}{2} - \gamma^2 - \gamma \log4 + 2 \log\left(\frac{\sigma}{\lengthscale}\right) \left[ \gamma + \log2 - \log \left(\frac{\sigma}{\lengthscale}\right) \right]
\end{align}
%
where $\gamma \approxeq 0.5772$ is Euler's constant. Since the second moment is finite, by the central limit theorem, the limiting distribution of the size of the gradient approaches a log-normal as L grows:
%
\begin{align}
\log \left| \frac{\partial f^{(1:L)}(x)}{\partial x} \right|
= \log \prod_{\layerindex=1}^L \left| \frac{\partial f^{(\layerindex)}(x)}{\partial x} \right|
= \sum_{\layerindex=1}^L \log \left| \frac{\partial f^{(\layerindex)}(x)}{\partial x} \right|
% \implies
%\log \left| \frac{\partial f^{1:L}(x)}{\partial x} \right| & \distas{L \rightarrow \infty} \Nt{L \sqrt{\frac{2 \sigma^2_f}{\pi\lengthscale^2}}}{L \frac{\sigma^2_f}{\lengthscale^2} \left( 1 - \frac{2}{\pi} \right)}
%\log \left| \frac{\partial f^{(1:L)}(x)}{\partial x} \right| &
\distas{L \rightarrow \infty} \Nt{ L m_{\log} }{L^2 v_{\log}}
\end{align}
%
Even if the expected magnitude of the derivative remains constant, the variance of the log-normal distribution grows without bound as the depth increases.
Because the log-normal distribution is heavy-tailed and its domain is bounded below by zero, the derivative will become very small almost everywhere, with rare but very large jumps.
\Cref{fig:deep_draw_1d} shows this behavior in a draw from a 1D deep \gp{} prior.
This figure also shows that once the derivative in one region of the input space becomes very large or very small, it is likely to remain that way in subsequent layers.
%
%Does it convege to some sort of jump process?
%Note that there is another limit - if $\nicefrac{\sigma^2_f}{\lengthscale_{d_1}^2} < \nicefrac{\pi}{2}$, then the mean of the size of the gradient goes to zero, but the variance approaches a finite constant.
%\subsection{The Jacobian of deep GP is a product of independent normal matrices}
\subsection{Distribution of the Jacobian}
\label{sec:theorem}
Next, we characterize the distribution on Jacobians of multivariate functions drawn from deep \gp{} priors, finding them to be products of independent Gaussian matrices with independent entries.
\begin{lemma}
\label{thm:deriv-ind}
The partial derivatives of a function mapping $\mathbb{R}^D \rightarrow \mathbb{R}$ drawn from a \gp{} prior with a product kernel are independently Gaussian distributed.
\end{lemma}
%
%\vspace{-0.25in}
\begin{proof}
Because differentiation is a linear operator, the derivatives of a function drawn from a \gp{} prior are also jointly Gaussian distributed.
The covariance between partial derivatives with respect to input dimensions $d_1$ and $d_2$ of vector $\vx$ are given by \citet{Solak03derivativeobservations}:
%
\begin{align}
\cov \left( \frac{\partial f(\vx)}{\partial x_{d_1}}, \frac{\partial f(\vx)}{\partial x_{d_2}} \right)
= \frac{\partial^2 k(\vx, \vx')}{\partial x_{d_1} \partial x_{d_2}'} \bigg|_{\vx=\vx'}
\label{eq:deriv-kernel}
\end{align}
%
If our kernel is a product over individual dimensions $k(\vx, \vx') = \prod_d^D k_d(x_d, x_d')$,
%as in the case of the squared-exp kernel,
then the off-diagonal entries are zero, implying that all elements are independent.
\end{proof}
%\vspace{-0.1in}
For example, in the case of the multivariate squared-exp kernel, the covariance between derivatives has the form:
%
\begin{align}
%k_{SE}(\vx, \vx') =
& f(\vx) \sim \textnormal{GP}\left( 0,
\sigma^2 \prod_{d=1}^D \exp \left(-\frac{1}{2} \frac{(x_d - x_d')^2}{\lengthscale_d^2} \right) \right) \nonumber \\
& \implies
\cov \left( \frac{\partial f(\vx)}{\partial x_{d_1}}, \frac{\partial f(\vx)}{\partial x_{d_2}} \right) =
%\frac{\partial^2 k_{SE}(\vx, \vx')}{\partial x_{d_1} \partial x_{d_2}'} \bigg|_{\vx=\vx'} =
\begin{cases}
\frac{\sigma^2}{\lengthscale_{d_1}^2} & \mbox{if } d_1 = d_2 \\
0 & \mbox{if } d_1 \neq d_2 \end{cases}
\end{align}
\begin{lemma}
\label{thm:matrix}
The Jacobian of a set of $D$ functions $\mathbb{R}^D \rightarrow \mathbb{R}$ drawn from independent \gp{} priors, each having product kernel is a $D \times D$ matrix of independent Gaussian R.V.'s
\end{lemma}
%
%\vspace{-0.2in}
\begin{proof}
The Jacobian of the vector-valued function $\vf(\vx)$ is a matrix $J$ with elements ${J_{ij}(\vx) = \frac{ \partial f_i (\vx) }{\partial x_j}}$.
%
%\begin{align}
%J_{ij} = \dfrac{\partial f_i (\vx) }{\partial x_j}
%\end{align}
%\begin{align}
%\Jx^\layerindex(\vx) =\begin{bmatrix} \dfrac{\partial f^\layerindex_1 (\vx) }{\partial x_1} & \cdots & \dfrac{\partial f^\layerindex_1 (\vx)}{\partial x_D} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f^\layerindex_D (\vx)}{\partial x_1} & \cdots & \dfrac{\partial f^\layerindex_D (\vx)}{\partial x_D} \end{bmatrix}
%\end{align}
%
%
Because the \gp{}s on each output dimension $f_1(\vx), f_2(\vx), \dots, f_D(\vx)$ are independent by construction, it follows that each row of $J$ is independent.
Lemma \ref{thm:deriv-ind} shows that the elements of each row are independent Gaussian.
Thus all entries in the Jacobian of a \gp{}-distributed transform are independent Gaussian R.V.'s.
\end{proof}
\begin{theorem}
\label{thm:prodjacob}
The Jacobian of a deep \gp{} with a product kernel is a product of independent Gaussian matrices, with each entry in each matrix being drawn independently.
\end{theorem}
%
%\vspace{-0.2in}
\begin{proof}
When composing $L$ different functions, we denote the \emph{immediate} Jacobian of the function mapping from layer $\layerindex -1$ to layer $\layerindex$ as $J^{(\layerindex)}(\vx)$, and the Jacobian of the entire composition of $L$ functions by $J^{(1:L)}(\vx)$.
%
By the multivariate chain rule, the Jacobian of a composition of functions is given by the product of the immediate Jacobian matrices of each function.
%
Thus the Jacobian of the composed (deep) function $\vf^{(L)}(\vf^{(L-1)}(\dots \vf^{(3)}( \vf^{(2)}( \vf^{(1)}(\vx)))\dots))$ is
%
% $J^L(f^{(L-1)}(\vx) \ldots J^2(f^{(1)}(\vx))J^1(\vx)$
%
\begin{align}
J^{(1:L)}(\vx)
% = \frac{\partial \fdeep(x) }{\partial x}
%= \frac{\partial f^1(x) }{\partial x} \frac{\partial f^{(2)}(x) }{\partial f^{(1)}(x)} \cdots \frac{\partial f^L(x) }{\partial f^({L-1})(x)}
%= \prod_{\layerindex = 1}^{L} J^L (x)
= J^{(L)} J^{(L-1)} \dots J^{(3)} J^{(2)} J^{(1)}.
%= J^{(L)} J^{(L-1)} \dots J^{(3)}(\vf^{(1:2)}(\vx)) J^{(2)}(\vf^{(1)}(\vx)) J^{(1)}(\vx).
\label{eq:jacobian-of-deep-gp}
\end{align}
%
By lemma \ref{thm:matrix}, each ${J^{(\layerindex)}_{i,j} \simind \mathcal{N}}$, so the complete Jacobian is a product of independent Gaussian matrices, with each entry of each matrix drawn independently.
\end{proof}
%\Cref{thm:prodjacob}
This result allows us to analyze the representational properties of a deep Gaussian process by examining the properties of products of independent Gaussian matrices.
\section{Formalizing a pathology}
\label{sec:formalizing-pathology}
A common use of deep neural networks is building useful representations of data manifolds.
What properties make a representation useful?
\citet{rifai2011higher} argued that good representations of data manifolds are invariant in directions orthogonal to the data manifold.
They also argued that, conversely, a good representation must also change in directions tangent to the data manifold, in order to preserve relevant information.
\Cref{fig:hidden} visualizes a representation having these two properties.
%\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\begin{figure}[h]
\centering
%\begin{tabular}{L{\columnwidth}}
%\includegraphics[width=0.45\columnwidth]{\deeplimitsfiguresdir/hidden_good} &
\begin{tikzpicture}[pile/.style={thick, ->, >=stealth'}]
\node[anchor=south west,inner sep=0] at (0,0) {
\includegraphics[clip, trim = 0cm 12cm 0cm 0.0cm, width=0.6\columnwidth]{\deeplimitsfiguresdir/hidden_good}
};
\coordinate (D) at (1.6,1.5);
\coordinate (Do) at (2.5, 0.8);
\coordinate (Dt) at (3,3);
\draw[pile] (D) -- (Dt) node[above, text width=5em] { tangent };
\draw[pile] (D) -- (Do) node[right, text width=5em] { orthogonal };
\end{tikzpicture}
%A noise-tolerant representation of \\ a one-dimensional manifold (white)
%\includegraphics[clip, trim = 0cm 12cm 0cm 0.0cm, width=0.9\columnwidth]{\deeplimitsfiguresdir/hidden_bad} \\
%b) A na\"{i}ve representation (colors) \\ of a one-dimensional manifold (white)
%\end{tabular}
\caption[Desirable properties of representations of manifolds]
{Representing a 1-D data manifold.
%A representation is a function mapping the input space to some set of outputs.
Colors are a function of the computed representation of the input space.
The representation (blue \& green) changes little in directions orthogonal to the manifold (white), making it robust to noise in those directions.
The representation also varies in directions tangent to the data manifold, preserving information for later layers.
%, and reducing the number of parameters needed to represent a datapoint.
%Representation b) changes in all directions, preserving potentially useless information.}% The representation on the right might be useful if the data were spread out in over plane.
}
\label{fig:hidden}
\end{figure}
As in \citet{rifai2011contractive}, we characterize the representational properties of a function by the singular value spectrum of the Jacobian.
The number of relatively large singular values of the Jacobian indicate the number of directions in data-space in which the representation varies significantly.
%
\newcommand{\spectrumpic}[1]{
\includegraphics[trim=4mm 1mm 4mm 2.5mm, clip, width=0.475\columnwidth]{\deeplimitsfiguresdir/spectrum/layer-#1}}%
\begin{figure}
\centering
\begin{tabular}{ccc}
& 2 Layers & 6 Layers \\
\begin{sideways} { \quad Normalized singular value} \end{sideways} & \hspace{-0.2in} \spectrumpic{2} & \hspace{-0.1in} \spectrumpic{6} \\
& { Singular value index} & { Singular value index}
\end{tabular}
\caption[Distribution of singular values of the Jacobian of a deep \sgp{}]
{%The normalized singular value spectrum of the Jacobian of a deep \gp{} warping $\Reals^5 \to \Reals^5$.
The distribution of normalized singular values of the Jacobian of a function drawn from a 5-dimensional deep \gp{} prior 25 layers deep~(\emph{Left}) and 50 layers deep~(\emph{Right}).
As nets get deeper, the largest singular value tends to become much larger than the others.
This implies that with high probability, these functions vary little in all directions but one, making them unsuitable for computing representations of manifolds of more than one dimension.
%As depth increases, the distribution on singular values also becomes heavy-tailed.
}
\label{fig:deep_spectrum}
\end{figure}%
%
\begin{figure}%
\centering
\begin{tabular}{cc}
No transformation: $p(\vx)$ & 1 Layer: $p \left( \vf^{(1)}(\vx) \right)$ \\
\gpdrawbox{1} & \gpdrawbox{2} \\
4 Layers: $p \left( \vf^{(1:4)}(\vx) \right)$ & 6 Layers: $p \left( \vf^{(1:6)}(\vx) \right)$ \\
\gpdrawbox{4} & \gpdrawbox{6}
\end{tabular}
\caption[Points warped by a draw from a deep \sgp{}]
{Points warped by a function drawn from a deep \gp{} prior.
\emph{Top left:} Points drawn from a 2-dimensional Gaussian distribution, color-coded by their location.
\emph{Subsequent panels:} Those same points, successively warped by compositions of functions drawn from a \gp{} prior.
As the number of layers increases, the density concentrates along one-dimensional filaments.
Warpings using random finite neural networks exhibit the same pathology, but also tend to concentrate density into 0-dimensional manifolds (points) due to saturation of all of the hidden units.}
\label{fig:filamentation}
\end{figure}%
%
\Cref{fig:deep_spectrum} shows the distribution of the singular value spectrum of draws from 5-dimensional deep \gp{}s of different depths.\footnote{\citet{rifai2011contractive} analyzed the Jacobian at location of the training points, but because the priors we are examining are stationary, the distribution of the Jacobian is identical everywhere.}
As the nets gets deeper, the largest singular value tends to dominate, implying there is usually only one effective degree of freedom in the representations being computed.
\Cref{fig:filamentation} demonstrates a related pathology that arises when composing functions to produce a deep density model.
The density in the observed space eventually becomes locally concentrated onto one-dimensional manifolds, or \emph{filaments}.
This again suggests that, when the width of the network is relatively small, deep compositions of independent functions are unsuitable for modeling manifolds whose underlying dimensionality is greater than one.
%for file in `ls`; do convert $file $file; done
\newcommand{\mappic}[1]{ \includegraphics[width=0.475\columnwidth]{\deeplimitsfiguresdir/map/latent_coord_map_layer_#1} }
\newcommand{\mappiccon}[1]{ \includegraphics[width=0.475\columnwidth]{\deeplimitsfiguresdir/map_connected/latent_coord_map_layer_#1} }
\begin{figure}
\centering
\begin{tabular}{cc}
\hspace{-0.15in} Identity Map: $\vy = \vx$ &
\hspace{-0.15in} 1 Layer: $\vy = \vf^{(1)}(\vx)$ \\
\hspace{-0.15in} \mappic{0} & \mappic{1} \\
\hspace{-0.15in} 10 Layers: $\vy = \vf^{(1:10)}(\vx)$ &
\hspace{-0.15in} 40 Layers: $\vy = \vf^{(1:40)}(\vx)$ \\
\hspace{-0.15in} \mappic{10} & \mappic{40}
\end{tabular}
\caption[Visualization of a feature map drawn from a deep \sgp{}]
{A visualization of the feature map implied by a function $\vf$ drawn from a deep \gp{}.
Colors are a function of the 2D representation $\vy = \vf(\vx)$ that each point is mapped to. % by the deep function. %that each point is mapped to after being warped by a deep \gp{}.
The number of directions in which the color changes rapidly corresponds to the number of large singular values in the Jacobian.
Just as the densities in \cref{fig:filamentation} became locally one-dimensional, there is usually only one direction that one can move $\vx$ in locally to change $\vy$.
This means that $\vf$ is unlikely to be a suitable representation for decision tasks that depend on more than one aspect of $\vx$. Also note that the overall shape of the mapping remains the same as the number of layers increase.
For example, a roughly circular shape remains in the top-left corner even after 40 independent warpings.}
\label{fig:deep_map}
\end{figure}
%
To visualize this pathology in another way, \cref{fig:deep_map} illustrates a color-coding of the representation computed by a deep \gp{}, evaluated at each point in the input space.
After 10 layers, we can see that locally, there is usually only one direction that one can move in $\vx$-space in order to change the value of the computed representation, or to cross a decision boundary.
This means that such representations are likely to be unsuitable for decision tasks that depend on more than one property of the input.
To what extent are these pathologies present in the types of neural networks commonly used in practice?
In simulations, we found that for deep functions with a fixed hidden dimension $D$, the singular value spectrum remained relatively flat for hundreds of layers as long as $D > 100$.
Thus, these pathologies are unlikely to severely effect the relatively shallow, wide networks most commonly used in practice.
\section{Fixing the pathology}
\label{sec:fix}
As suggested by \citet[chapter 2]{neal1995bayesian}, we can fix the pathologies exhibited in figures \cref{fig:filamentation} and \ref{fig:deep_map} by simply making each layer depend not only on the output of the previous layer, but also on the original input $\vx$.
We refer to these models as \emph{input-connected} networks, and denote deep functions having this architecture with the subscript $C$, as in $f_C(\vx)$.
Formally, this functional dependence can be written as
\begin{align}
\vf_C^{(1:L)}(\vx) = \vf^{(L)} \left( \vf_C^{(1:L-1)}(\vx), \vx \right), \quad \forall L
\end{align}
%
\Cref{fig:input-connected} shows a graphical representation of the two connectivity architectures.
\begin{figure}[h]%
\def\nodeseptwo{1.9cm}%
\def\nodesize{.35cm}%
\def\numhiddentwo{3}%
\centering
\begin{tabular}{ccc}
a) Standard \MLP{} connectivity & &
b) Input-connected architecture\\
\hspace{-4mm}
\begin{tikzpicture}[draw=black!80]
\tikzstyle{neuron}=[circle,minimum size=17pt, draw = black!80, fill = white, thick]
\tikzstyle{input neuron}=[neuron, fill=green!50];
\tikzstyle{output neuron}=[neuron, fill=red!50];
\tikzstyle{hidden neuron}=[neuron, fill=blue!50];
\tikzstyle{pile} =[thick, ->, >=stealth', shorten <=7pt, shorten >=8pt];
% Define the input layer node
\coordinate (I) at (0, 0);
% Define the hidden layer nodes
\foreach \name / \y in {1,...,\numhiddentwo} {
\coordinate (H-\name) at (\nodeseptwo*\y, 0);
}
\path[pile] (I) edge (H-1) {};
% Connect every node
\foreach \name in {2,...,\numhiddentwo} {
\pgfmathsetmacro\hindex{\name - 1}
\path[pile] (H-\hindex) edge (H-\name) {};
}
\draw (I) node[neuron] {};
\draw (I) node[below = 0.5cm] {$\vx$};
% Draw the hidden layer nodes
\foreach \name / \y in {1,...,\numhiddentwo} {
\draw (H-\name) node[neuron] {};
\draw (H-\name) node[below = 0.34cm] {$\vf^{(\y)}(\vx)$};
}
\end{tikzpicture} &
\hspace{0.5cm} &
\begin{tikzpicture}[draw=black!80]
\tikzstyle{neuron}=[circle,minimum size=17pt, draw = black!80, fill = white, thick]
\tikzstyle{input neuron}=[neuron, fill=green!50];
\tikzstyle{output neuron}=[neuron, fill=red!50];
\tikzstyle{hidden neuron}=[neuron, fill=blue!50];
\tikzstyle{pile} =[thick, ->, >=stealth', shorten <=7pt, shorten >=8pt];
% Define the input layer node
\coordinate (I) at (0, 0);
% Define the hidden layer nodes
\foreach \name / \y in {1,...,\numhiddentwo} {
\coordinate (H-\name) at (\nodeseptwo*\y, 0);
}
% Connect every node
\path[pile] (I) edge (H-1) {};
\foreach \name in {2,...,\numhiddentwo} {
\pgfmathsetmacro\hindex{\name - 1}
\path[pile] (H-\hindex) edge (H-\name) {};
\path[pile] (I) edge [bend left] (H-\name) {};
}
\draw (I) node[neuron] {};
\draw (I) node[below = 0.5cm] {$\vx$};
% Draw the hidden layer nodes
\foreach \name / \y in {1,...,\numhiddentwo} {
\draw (H-\name) node[neuron] {};
\draw (H-\name) node[below = 0.34cm] {$\vf_C^{(\y)}(\vx)$};
}
\end{tikzpicture}
\end{tabular}
\caption[Two different architectures for deep neural networks]
{Two different architectures for deep neural networks.
\emph{Left:} The standard architecture connects each layer's outputs to the next layer's inputs.
\emph{Right:} The input-connected architecture also connects the original input $\vx$ to each layer.}
\label{fig:input-connected}
\end{figure}%
Similar connections between non-adjacent layers can also be found the primate visual cortex \citep{maunsell1983connections}.
Visualizations of the resulting prior on functions are shown in \cref{fig:deep_draw_1d_connected,fig:no_filamentation,fig:deep_map_connected}.
\begin{figure}[h]
\centering
\setlength{\tabcolsep}{1.5pt}
\begin{tabular}{ccccc}
& 1 Layer & 2 Layers & 5 Layers & 10 Layers \\
\raisebox{0.6cm}{\rotatebox{90}{$f_C^{(1:L)}(x)$}} &
\onedsamplepiccon{1} &
\onedsamplepiccon{2} &
\onedsamplepiccon{5} &
\onedsamplepiccon{10} \\[-3pt]
& $x$ & $x$ & $x$ & $x$
\end{tabular}
\caption[A draw from a 1D deep \sgp{} prior with each layer connected to the input]
{A draw from a 1D deep \gp{} prior having each layer also connected to the input.
The $x$-axis is the same for all plots.
Even after many layers, the functions remain relatively smooth in some regions, while varying rapidly in other regions.
Compare to standard-connectivity deep \gp{} draws shown in \cref{fig:deep_draw_1d}.}
\label{fig:deep_draw_1d_connected}
\end{figure}
%
\newcommand{\gpdrawboxcon}[1]{
\setlength\fboxsep{0pt}
\hspace{-0.2in}
\fbox{
\includegraphics[width=0.464\columnwidth]{\deeplimitsfiguresdir/deep_draws_connected/deep_sample_connected_layer#1}
}}
%
\begin{figure}
\centering
\begin{tabular}{cc}
3 Connected layers: $p \left( \vf_C^{(1:3)}(\vx) \right)$ & 6 Connected layers: $p \left( \vf_C^{(1:6)}(\vx) \right)$ \\
\gpdrawboxcon{3} &
\gpdrawboxcon{6}
\end{tabular}
\caption[Points warped by a draw from an input-connected deep \sgp{}]
{Points warped by a draw from a deep \sgp{} with each layer connected to the input $\vx$.
%\emph{Top left:} Points drawn from a 2-dimensional Gaussian distribution, color-coded by their location.
%\emph{Subsequent panels:} Those same points, successively warped by functions drawn from a \gp{} prior.
As depth increases, the density becomes more complex without concentrating only along one-dimensional filaments.}
\label{fig:no_filamentation}
\end{figure}
%
\begin{figure}
\centering
\newcommand{\spectrumpiccon}[1]{
\includegraphics[trim=4mm 1mm 4mm 2.5mm, clip, width=0.475\columnwidth]{\deeplimitsfiguresdir/spectrum/con-layer-#1}}
\begin{tabular}{ccc}
& 25 layers & 50 layers \\
\hspace{-0.5cm} \begin{sideways} {\quad Normalized singular value} \end{sideways} & \hspace{-0.2in} \spectrumpiccon{25} & \hspace{-0.16in} \spectrumpiccon{50} \\