Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection
|
|
- Rosalyn Parrish
- 5 years ago
- Views:
Transcription
1 Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake Tahoe, December 2013 Based on joint work with Ohad Shamir Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 1 / 25
2 Neural Networks A single neuron with activation function σ : R R x 1 x 2 x 3 x 4 v 1 v 2 v 3 v 4 v 5 σ( v, x ) x 5 Usually, σ is taken to be a sigmoidal function Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 2 / 25
3 Neural Networks A multilayer neural network of depth 3 and size 6 Input layer Hidden layer Hidden layer Output layer x 1 x 2 x 3 x 4 x 5 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 3 / 25
4 Why Deep Neural Networks are Great? Because A used it to do B Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25
5 Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25
6 Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Not convincing because It can be shown that the size of the network must be exponential in d, so why should we care about such large networks? Many other universal approximators exist (nearest neighbor, boosting with decision stumps, SVM with RBF kernels), so why should we prefer neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25
7 Why Deep Neural Networks are Great? A Statistical Learning Perspective Goal: Learn a function h : X Y based on training examples S = ((x 1, y 1 ),..., (x m, y m )) (X Y) m No-Free-Lunch Theorem: For any algorithm A, and any sample size m, there exists a distribution D over X Y and a function h such that h is perfect w.r.t. D but with high probability over S D m, the output of A is very bad Prior knowledge: We must bias the learner toward reasonable functions hypothesis class H Y X What should be H? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 5 / 25
8 Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25
9 Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? The computational barrier: But, how do we train neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25
10 Neural Networks The computational barrier It is NP hard to implement ERM for a depth 2 network with k 3 hidden neurons whose activation function is sigmoidal or sign (Blum and Rivest 1992, Bartlett and Ben-David 2002) Current approaches: Back propagation, possibly with unsupervised pre-training and other bells and whistles No theoretical guarantees, and often requires manual tweaking Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 7 / 25
11 Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 8 / 25
12 Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25
13 Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Maybe we can efficiently learn neural network using over-specification? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25
14 Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
15 Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Proof idea: W.h.p. over perturbation of v, F (v; X) has full rank. For such v, if we re not at global minimum, just by changing w we can decrease the objective Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
16 Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
17 Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Theorem (Daniely, Linial, S.) Even if the data is perfectly generated by a neural network of depth 2 and with only k = ω(1) neurons in the hidden layer, there is no algorithm that can achieve small test error Corollary: over-specification alone is not enough for efficient learnability Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
18 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
19 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
20 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
21 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Can also be used to establish Computational-Statistical Tradeoffs (Daniely, Linial, S., NIPS 13) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
22 Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
23 Circumventing hardness sum-product networks Simpler non-linearity replace sigmoidal activation function by the square function σ(a) = a 2 Network implements polynomials, where the depth corresponds to degree The size of the network (number of neurons) determines generalization properties and evaluation time Can we efficiently learn the class of polynomial networks of small size? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
24 Depth 2 polynomial network Input layer x 1 x 2 Hidden layer v 1, x 2 Output layer x 3 v 2, x 2 i w i v i, x 2 x 4 x 5 v 3, x 2 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
25 Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
26 Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
27 Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Using d 2 hidden neurons suffices (trivial) Can we do better? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
28 Forward Greedy Selection Goal: minimize R(w) s.t. w 0 r Define: for I [d], w I = argmin w:supp(w) I R(w) Forward Greedy Selection (FGS) Start with I = { } For t = 1, 2,... Pick j s.t. j R(w I ) (1 τ) max i i R(w I ) Let J = I {j} Replacement step: Let I be any set s.t. R(w I ) R(w J ) and I J Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
29 Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
30 Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Remarks Dimension of w has no effect on the theorem! w is any vector (not necessarily the optimum ) If w 2 = O(1) then w 2 1 w 0 w 2 2 = O ( w 0). Bound depends nicely on τ. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
31 Forward Greedy Selection for Sum-Product Networks Let S be the Euclidean sphere of R d Think on w as a mapping from S to R Each hypothesis in H corresponds to some w with w 0 = r and w 2 = O(1) R( w) = training loss of the hypothesis in H corresponding to w Applying FGS for k = Ω(r/ɛ) iterations would yield a network with O(r/ɛ) hidden neurons and loss of at most R( w) + ɛ Main caveat: at each iteration we need to find v s.t. v R(w) (1 τ) max u S ur(w) Luckily, this is an eigenvalue problem v R(w) = v E (x,y) l u supp(w) w u u, x 2, y xx v Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
32 The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
33 The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Remarks: Finding an approximate leading eigenvector takes linear time Overall complexity depends linearly on the size of the data Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
34 Deeper sum-product networks? Learning depth 2 sigmoidal networks is hard even if we allow over-specification Learning depth 2 sum-product networks is tractable if we allow slight over-specification What about higher degrees? Theorem (Livni, Shamir, S.): It is hard to learn polynomial networks of depth poly(d) even if their size is poly(d). Proof idea: It is possible to approximate the sigmoid function with a polynomial of degree poly(d) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
35 Deeper sum-product networks? Distributional assumptions: Is it easier to learn under certain distributional assumptions? Vanishing Component Analysis (Livni et al 2013) Efficiently finding the generating ideal of the data Can be used to efficiently construct deep features Theoretical guarantees still not satisfactory Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
36 Summary Sigmoidal deep networks are great statistically but cannot be trained efficiently Sum-product networks of depth 2 can be trained efficiently using forward greedy selection Very deep sum-product networks cannot be trained efficiently Open problems Is it possible to train sum-product networks of depth 3? What about depth log(d)? Find a combination of network architecture and distributional assumptions that are useful in practice and lead to efficient algorithms Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
37 Thanks! Collaborators Seek of efficient algorithms for deep learning: Ohad Shamir GECO: Alon Gonen and Ohad Shamir Based on a previous paper with Tong Zhang and Nati Srebro Lower bounds: Amit Daniely and Nati Linial VCA: Livni, Lehavi, Nachlieli, Schein, Globerson Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
38 Shameless Advertisement Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25
On the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationIntroduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm
More informationFailures of Gradient-Based Deep Learning
Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz
More informationNeural Networks: Introduction
Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationUsing More Data to Speed-up Training Time
Using More Data to Speed-up Training Time Shai Shalev-Shwartz Ohad Shamir Eran Tromer Benin School of CSE, Microsoft Research, Blavatnik School of Computer Science Hebrew University, Jerusalem, Israel
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationLearning Deep Architectures for AI. Part I - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLearning from Examples
Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationExpressiveness of Rectifier Networks
Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU Abstract Rectified Linear Units (ReLUs have been shown to ameliorate the vanishing
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationSections 18.6 and 18.7 Analysis of Artificial Neural Networks
Sections 18.6 and 18.7 Analysis of Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline Univariate regression
More informationExpressiveness of Rectifier Networks
Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU From the learning point of view, the choice of an activation function is driven by
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationTRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS
TRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS SHAI SHALEV-SHWARTZ, NATHAN SREBRO, AND TONG ZHANG Abstract We study the problem of minimizing the expected loss of a linear
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationArtificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!
Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationLEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS
LEARNING KERNEL-BASED HALFSPACES WITH THE 0- LOSS SHAI SHALEV-SHWARTZ, OHAD SHAMIR, AND KARTHIK SRIDHARAN Abstract. We describe and analyze a new algorithm for agnostically learning kernel-based halfspaces
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationCOMP 551 Applied Machine Learning Lecture 14: Neural Networks
COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More information10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:
10 NEURAL NETWORKS TODO The first learning models you learned about (decision trees and nearest neighbor models) created complex, non-linear decision boundaries. We moved from there to the perceptron,
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationSections 18.6 and 18.7 Artificial Neural Networks
Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs artifical neural networks
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationFoundations of Deep Learning: SGD, Overparametrization, and Generalization
Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationFoundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results
More informationHierarchical Boosting and Filter Generation
January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers
More informationCS:4420 Artificial Intelligence
CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationUnderstanding Machine Learning A theory Perspective
Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationMythBusters: A Deep Learning Edition
1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationArtificial Neuron (Perceptron)
9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationShareBoost: Efficient Multiclass Learning with Feature Sharing
ShareBoost: Efficient Multiclass Learning with Feature Sharing Shai Shalev-Shwartz Yonatan Wexler Amnon Shashua Abstract Multiclass prediction is the problem of classifying an object into a relevant target
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationBasic Principles of Unsupervised and Unsupervised
Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization
More informationOnline Learning Summer School Copenhagen 2015 Lecture 1
Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationIntroduction to Machine Learning
Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity
More information