Open Problem: The landscape of the loss surfaces of multilayer networks

Size: px

Start display at page:

Download "Open Problem: The landscape of the loss surfaces of multilayer networks"

Jemima Lloyd
6 years ago
Views:

1 JMLR: Workshop and Conference Proceedngs vol 4: 5, 5 8th Annual Conference on Learnng Theory Open Problem: The landscape of the loss surfaces of multlayer networks Anna Choromanska Courant Insttute of Mathematcal Scences, New York Unversty, New York ACHOROMA@CIMS.NYU.EDU Yann LeCun Courant Insttute of Mathematcal Scences, New York Unversty, and Facebook Research, New York Gérard Ben Arous Courant Insttute of Mathematcal Scences, New York Unversty, New York YANN@CS.NYU.EDU BENAROUS@CIMS.NYU.EDU Edtor: Under Revew for COLT 5 Abstract Deep learnng has enjoyed a resurgence of nterest n the last few years for such applcatons as mage and speech recognton, or natural language processng. The vast majorty of practcal applcatons of deep learnng focus on supervsed learnng, where the supervsed loss functon s mnmzed usng stochastc gradent descent. The propertes of ths hghly non-convex loss functon, such as ts landscape and the behavor of crtcal ponts (maxma, mnma, and saddle ponts), as well as the reason why large- and small-sze networks acheve radcally dfferent practcal performance, are however very poorly understood. It was only recently shown that new results n spn-glass theory potentally may provde an explanaton for these problems by establshng a connecton between the loss functon of the neural networks and the Hamltonan of the sphercal spn-glass models. The connecton between both models reles on a number of possbly unrealstc assumptons, yet the emprcal evdence suggests that the connecton may exst n real. The queston we pose s whether t s possble to drop some of these assumptons to establsh a stronger connecton between both models. Keywords: multlayer networks, deep learnng, sphercal spn-glass model, Hamltonan, nonconvex optmzaton. Introducton The vast majorty of practcal applcatons of deep learnng use mult-stage archtectures composed of alternated layers of lnear transformatons and max functons (most often Rectfed Lnear Unts, e.g. Nar and Hnton ()), and focus on supervsed learnng, where the loss functon that needs to be mnmzed s most often cross entropy or hnge loss. Several researchers expermentng wth larger networks had notced that, whle multlayer nets do have many local mnma, the result of multple experments consstently gve very smlar performance. Ths suggests that all those local mnma are more or less equvalent n terms of error. It was also prevously notced that the problem of tranng deep learnng systems resdes wth avodng saddle ponts and quckly breakng the symmetry by pckng sdes of saddle ponts and choosng a sutable attractor LeCun et al. (998); Saxe et al. (4); Dauphn et al. (4). Earler theoretcal analyss, convenently revewed n Dauphn et al. (4), suggest the exstence of a certan structure of crtcal ponts of random Gaussan error functons on hgh dmensonal contnuous spaces. They mply that crtcal ponts whose error s much hgher than the global mnmum are exponentally lkely to be saddle ponts wth many negatve and approxmate plateau drectons whereas all local mnma are lkely to have an error very close to that of the global mnmum. Ther work establshes a strong emprcal connecton between neural networks and the theory c 5 A. Choromanska, Y. LeCun & G. Ben Arous.

2 CHOROMANSKA LECUN BEN AROUS of random Gaussan felds by provdng expermental evdence that the cost functon of neural networks exhbts the same propertes as the Gaussan error functons on hgh dmensonal contnuous spaces. Nevertheless they provde no theoretcal justfcaton for the exstence of ths connecton.. The connecton between multlayer networks and spn-glass models We next dscuss the assumptons that were made n Choromanska et al. (5) to establsh a connecton between the loss functon of neural networks and the Hamltonan of the sphercal spn-glass models (for detaled explanatons see Choromanska et al. (5)). The assumptons are numbered and marked wth letter resp. p or u denotng whether the assumpton s resp. plausble,.e. t can be satsfed n practce or else t can be mposed on the network wthout sgnfcantly changng ts performance, or obvously unrealstc, e.g Ap denotes the frst assumpton, whch s plausble. It can be shown that the loss functon of a typcal multlayer network wth ReLUs can be expressed as a polynomal functon of the weghts n the network, whose degree s the number of layers, and whose number of monomals s the number of paths (denoted as Ψ) from nputs to outputs. As the weghts (or the nputs) vary, some of the monomals are swtched off and others become actvated. Consder a smple model of a fully-connected feed-forward neural network wth H hdden layers (n denotes the number of unts n the th hdden layer, where nput layer has ndex = and output layer has ndex = H), and havng a sngle output (consder bnary classfcaton problem). Let Λ = H Ψ, and we assume Λ Z +. Let X be the random nput of the th path of a network. Then the normalzed output of the network can be expressed as Y = Λ (H )/ Ψ X A = H k= w (k), where w (k) s the weght of the k th segment of the th path (ths segment connects layer (k ) wth layer k of the network), and A s a Bernoull random varable denotng whether the th path s actve (A = ) or not (A = ). Consder hnge loss L(w) = max(, Y t Y ), where Y t s a random varable correspondng to the true data labelng takng values or, and w denotes all network weghts. Recall that max operator s often modeled as Bernoull random varable takng values or. Denote ths random varable as M and ts expectaton as ρ. Therefore Ψ H L(w) = M( Y t Y ) = M + Λ (H )/ Z I w (k), () where Z = Y t X, and I = MA s a Bernoull random varable takng values or. Assume random varables I, I,..., I Ψ have the same probablty of success (Ap), and thus they have the same expectaton denoted as ρ. Also assumng that each X s a standard Gaussan random varable (Ap), t follows that Z s also a standard Gaussan random varable. For large-sze networks large number of network parameters are redundant Denl et al. (3) and can ether be learned from a very small set of unque parameters or not learned at all wth almost no loss n predcton accuracy. Assume that Λ s the maxmal number of non-redundant (unque) parameters (A3p), and that they are unformly dstrbuted on the graph of connectons of the network (A4p),.e. every H-length product of unque weghts appears n Equaton (the set of all products s {w w... w H } Λ,,..., H = ). Thus re-ndexng the terms gves L(w) = M + Λ (H )/ Λ,,..., H = = k= Z,,..., H I,,..., H w w... w H. Assumng (A5u) the ndependence of Z,,..., H and I,,..., H one obtans

3 OPEN PROBLEM: THE LANDSCAPE OF THE LOSS SURFACES OF MULTILAYER NETWORKS E M,I,I,...,I Ψ [L(w)] = ρ + ρ Λ Λ (H )/,,..., H = Z,,..., H w w... w H. It s also assumed that Z s are ndependent (A6u). Fnally, the sphercal assumpton (A7p) mposes that Λ Λ = w =. Note that the term n bold s a Hamltonan of the sphercal spn-glass model Auffnger et al. (). It was recently shown Auffnger et al. () that the Hamltonan of ths model has nterestng propertes when the sze of the model (Λ) goes to. We next lst these propertes along wth the possble nterpretaton for neural networks: () crtcal ponts form an ordered structure such that there exsts an energy barrer (a certan value of the Hamltonan) below whch wth overwhelmng probablty one can fnd only low-ndex crtcal ponts, most of whch are concentrated close to the barrer (ths would explan why n case of large networks recovered local mnma are typcally correspondng to the same test performance whch s not the case for small networks, () Recoverng the ground state,.e. global mnmum, takes exponentally long tme, () wth overwhelmng probablty one can fnd only hgh-ndex saddle ponts above energy E and there are exponentally many of those (ths would explan the mportance of saddle ponts n the optmzaton problem), (v) low-ndex crtcal ponts are geometrcally lyng closer to the ground state than hgh-ndex crtcal ponts (ths would explan why recoverng poor qualty local mnma, whch are far from the global mnmum, s more lkely for small-sze networks than for large-sze networks). Open problem: Is t possble to establsh a connecton between the loss functon of the neural networks and the Hamltonan of the sphercal spn-glass models under mlder assumptons? The central problem s to elmnate unrealstc assumptons of varable ndependence (A5-6u). Note that assumpton A5u mples that the actvaton mechansm of any path (for the th path t s denoted as I ) s ndependent of the nput data, whch clearly cannot be true. Smlarly, assumpton A6u mples all paths have ndependent nputs, whch cannot be true snce many paths share the same nput. Alternatvely, t would also be desred to fnd network archtectures for whch the connecton to spn-glass models can be establshed explctly wth only mld (plausble), f any, assumptons. References A. Auffnger, G. Ben Arous, and J. Cerny. Random matrces and complexty of spn glasses. arxv:3.9,. A. Choromanska, M. Henaff, M. Matheu, G. Ben Arous, and Y. LeCun. The loss surfaces of multlayer networks. In AISTATS, 5. Y. Dauphn, R. Pascanu, Ç. Gülçehre, K. Cho, S. Gangul, and Y. Bengo. Identfyng and attackng the saddle pont problem n hgh-dmensonal non-convex optmzaton. In NIPS. 4. M. Denl, B. Shakb, L. Dnh, M. Ranzato, and N. D. Fretas. Predctng parameters n deep learnng. In NIPS. 3. Y. LeCun, L. Bottou, G. Orr, and K. Muller. Effcent backprop. In Neural Networks: Trcks of the trade. Sprnger, 998. V. Nar and G. Hnton. Rectfed lnear unts mprove restrcted boltzmann machnes. In ICML,. A. M. Saxe, J. L. McClelland, and S. Gangul. Exact solutons to the nonlnear dynamcs of learnng n deep lnear neural networks. In ICLR. 4.. Index of L at w s the number of negatve egenvalues of the Hessan L at w. Local mnma have ndex. 3

4 CHOROMANSKA LECUN BEN AROUS Appendx A. Emprcal evdence 5 6 count 75 5 Lambda count 4 nhdden loss.8.9. loss Fgure : Dstrbutons of the scaled test losses for the spn-glass (left) and the neural network (rght) experments. In ths secton we brefly summarze a subset of results from Choromanska et al. (5) showng the smlarty between the loss functon of the neural networks and the Hamltonan of the sphercal spn-glass models. The spn glass model was smulated for Λ from 5 to 5, where for each value of Λ, the dstrbuton of mnma was obtaned by samplng ntal ponts on the unt sphere and performng stochastc gradent descent (SGD) to fnd a mnmum energy pont. The neural network model was smulated usng a scaled-down verson of MNIST, where each mage was downsampled to sze. networks were traned wth one hdden layer and nhdden = {5, 5,, 5, 5} hdden unts, each one startng from a random set of parameters sampled unformly wthn the unt cube. All networks were traned for epochs usng SGD wth learnng rate decay. The dstrbuton of the scaled test losses s compared n Fgure for both models. We see that for small values of Λ and nhdden, we obtan poor local mnma 3 on many experments. For larger values of Λ and nhdden, the varance of losses decreases, and the dstrbuton becomes ncreasngly concentrated around the energy barrer where local mnma have hgh qualty. Ths ndcates that () gettng stuck n poor local mnma s a major problem for smaller networks but becomes gradually of less mportance as the network sze ncreases, and () n case of larger networks recovered local mnma are typcally correspondng to the same test performance, whch s not the case for small networks. Appendx B. Sphercal spn-glass model Fgure captures exemplary plots of the dstrbutons of the mean number of crtcal ponts, local mnma and low-ndex saddle ponts. Clearly local mnma and low-ndex saddle ponts are located n the band ( ΛE (H), ΛE (H)), where ΛE (H) s the energy barrer and ΛE (H) corresponds to the ground state (global mnmum), whereas hgh-ndex saddle ponts can only be found above the energy barrer ΛE (H). Ths geometrc structure, f t s also true for multlayer neural networks, plays a crucal role n the optmzaton problem. The optmzer, e.g. SGD, often. To observe qualtatve dfferences n behavor for dfferent values of Λ (for spn-glass model) and nhdden (for neural network), t s necessary to rescale the loss values to make ther expected values approxmately equal. For spn-glasses, the expected value of the loss at crtcal ponts scales lnearly wth Λ, therefore the losses have to be dvded by Λ, whereas for neural networks, the expected value of the loss at crtcal ponts was emprcally found to scale wth nhdden accordng to power law E[L] e αnhddenβ (α and beta are coeffcents), therefore the losses were dvded by L/e αnhddenβ. 3. Almost all recovered solutons were local mnma wth ndex equal to (whle computng the ndex of solutons, all egenvalues less than. n magntude were set to ). 4

5 OPEN PROBLEM: THE LANDSCAPE OF THE LOSS SURFACES OF MULTILAYER NETWORKS easly avods the band of hgh-ndex crtcal ponts, whch have many negatve curvature drectons, and descends to the band of low-ndex crtcal ponts whch le closer to the global mnmum. Thus fndng bad-qualty soluton,.e. the one far away from the global mnmum, s hghly unlkely for large-sze networks (t s also confrmed by the expermental results n Fgure ). Furthermore, as shown n Fgure, low-ndex crtcal ponts are mostly concentrated close to the energy barrer ( peaked dstrbuton), whch would potentally explan why n case of large networks recovered local mnma are typcally correspondng to the same test performance whch s not the case for small networks. Mean number of crtcal ponts crtcal ponts (zoomed) x 5 3 Λ E Λ E nf 5 5 x crtcal ponts crtcal ponts (zoomed) x 8.5 k= k= k=.5 k=3 k=4 k=5 Λ E.5 Λ E nf x Fgure : Dstrbuton of the mean number of crtcal ponts, local mnma and low-ndex saddle ponts (orgnal and zoomed; k denotes the ndex). Parameters H and Λ were set to H = 3 and Λ =. Black lne: u = ΛE (H), red lne: u = ΛE (H). ΛE corresponds to ground state (global mnmum). Fgure must be read n color. 5

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume