Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Size: px
Start display at page:

Download "Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection"

Transcription

1 Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake Tahoe, December 2013 Based on joint work with Ohad Shamir Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 1 / 25

2 Neural Networks A single neuron with activation function σ : R R x 1 x 2 x 3 x 4 v 1 v 2 v 3 v 4 v 5 σ( v, x ) x 5 Usually, σ is taken to be a sigmoidal function Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 2 / 25

3 Neural Networks A multilayer neural network of depth 3 and size 6 Input layer Hidden layer Hidden layer Output layer x 1 x 2 x 3 x 4 x 5 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 3 / 25

4 Why Deep Neural Networks are Great? Because A used it to do B Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25

5 Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25

6 Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Not convincing because It can be shown that the size of the network must be exponential in d, so why should we care about such large networks? Many other universal approximators exist (nearest neighbor, boosting with decision stumps, SVM with RBF kernels), so why should we prefer neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25

7 Why Deep Neural Networks are Great? A Statistical Learning Perspective Goal: Learn a function h : X Y based on training examples S = ((x 1, y 1 ),..., (x m, y m )) (X Y) m No-Free-Lunch Theorem: For any algorithm A, and any sample size m, there exists a distribution D over X Y and a function h such that h is perfect w.r.t. D but with high probability over S D m, the output of A is very bad Prior knowledge: We must bias the learner toward reasonable functions hypothesis class H Y X What should be H? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 5 / 25

8 Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25

9 Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? The computational barrier: But, how do we train neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25

10 Neural Networks The computational barrier It is NP hard to implement ERM for a depth 2 network with k 3 hidden neurons whose activation function is sigmoidal or sign (Blum and Rivest 1992, Bartlett and Ben-David 2002) Current approaches: Back propagation, possibly with unsupervised pre-training and other bells and whistles No theoretical guarantees, and often requires manual tweaking Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 7 / 25

11 Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 8 / 25

12 Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25

13 Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Maybe we can efficiently learn neural network using over-specification? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25

14 Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

15 Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Proof idea: W.h.p. over perturbation of v, F (v; X) has full rank. For such v, if we re not at global minimum, just by changing w we can decrease the objective Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

16 Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

17 Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Theorem (Daniely, Linial, S.) Even if the data is perfectly generated by a neural network of depth 2 and with only k = ω(1) neurons in the hidden layer, there is no algorithm that can achieve small test error Corollary: over-specification alone is not enough for efficient learnability Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

18 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

19 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

20 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

21 Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Can also be used to establish Computational-Statistical Tradeoffs (Daniely, Linial, S., NIPS 13) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

22 Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

23 Circumventing hardness sum-product networks Simpler non-linearity replace sigmoidal activation function by the square function σ(a) = a 2 Network implements polynomials, where the depth corresponds to degree The size of the network (number of neurons) determines generalization properties and evaluation time Can we efficiently learn the class of polynomial networks of small size? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

24 Depth 2 polynomial network Input layer x 1 x 2 Hidden layer v 1, x 2 Output layer x 3 v 2, x 2 i w i v i, x 2 x 4 x 5 v 3, x 2 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

25 Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

26 Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

27 Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Using d 2 hidden neurons suffices (trivial) Can we do better? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

28 Forward Greedy Selection Goal: minimize R(w) s.t. w 0 r Define: for I [d], w I = argmin w:supp(w) I R(w) Forward Greedy Selection (FGS) Start with I = { } For t = 1, 2,... Pick j s.t. j R(w I ) (1 τ) max i i R(w I ) Let J = I {j} Replacement step: Let I be any set s.t. R(w I ) R(w J ) and I J Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

29 Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

30 Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Remarks Dimension of w has no effect on the theorem! w is any vector (not necessarily the optimum ) If w 2 = O(1) then w 2 1 w 0 w 2 2 = O ( w 0). Bound depends nicely on τ. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

31 Forward Greedy Selection for Sum-Product Networks Let S be the Euclidean sphere of R d Think on w as a mapping from S to R Each hypothesis in H corresponds to some w with w 0 = r and w 2 = O(1) R( w) = training loss of the hypothesis in H corresponding to w Applying FGS for k = Ω(r/ɛ) iterations would yield a network with O(r/ɛ) hidden neurons and loss of at most R( w) + ɛ Main caveat: at each iteration we need to find v s.t. v R(w) (1 τ) max u S ur(w) Luckily, this is an eigenvalue problem v R(w) = v E (x,y) l u supp(w) w u u, x 2, y xx v Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

32 The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

33 The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Remarks: Finding an approximate leading eigenvector takes linear time Overall complexity depends linearly on the size of the data Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

34 Deeper sum-product networks? Learning depth 2 sigmoidal networks is hard even if we allow over-specification Learning depth 2 sum-product networks is tractable if we allow slight over-specification What about higher degrees? Theorem (Livni, Shamir, S.): It is hard to learn polynomial networks of depth poly(d) even if their size is poly(d). Proof idea: It is possible to approximate the sigmoid function with a polynomial of degree poly(d) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

35 Deeper sum-product networks? Distributional assumptions: Is it easier to learn under certain distributional assumptions? Vanishing Component Analysis (Livni et al 2013) Efficiently finding the generating ideal of the data Can be used to efficiently construct deep features Theoretical guarantees still not satisfactory Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

36 Summary Sigmoidal deep networks are great statistically but cannot be trained efficiently Sum-product networks of depth 2 can be trained efficiently using forward greedy selection Very deep sum-product networks cannot be trained efficiently Open problems Is it possible to train sum-product networks of depth 3? What about depth log(d)? Find a combination of network architecture and distributional assumptions that are useful in practice and lead to efficient algorithms Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

37 Thanks! Collaborators Seek of efficient algorithms for deep learning: Ohad Shamir GECO: Alon Gonen and Ohad Shamir Based on a previous paper with Tong Zhang and Nati Srebro Lower bounds: Amit Daniely and Nati Linial VCA: Livni, Lehavi, Nachlieli, Schein, Globerson Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

38 Shameless Advertisement Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec / 25

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm

More information

Failures of Gradient-Based Deep Learning

Failures of Gradient-Based Deep Learning Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz

More information

Neural Networks: Introduction

Neural Networks: Introduction Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Using More Data to Speed-up Training Time

Using More Data to Speed-up Training Time Using More Data to Speed-up Training Time Shai Shalev-Shwartz Ohad Shamir Eran Tromer Benin School of CSE, Microsoft Research, Blavatnik School of Computer Science Hebrew University, Jerusalem, Israel

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI. Part I - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Expressiveness of Rectifier Networks

Expressiveness of Rectifier Networks Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU Abstract Rectified Linear Units (ReLUs have been shown to ameliorate the vanishing

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks Sections 18.6 and 18.7 Analysis of Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline Univariate regression

More information

Expressiveness of Rectifier Networks

Expressiveness of Rectifier Networks Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU From the learning point of view, the choice of an activation function is driven by

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

TRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS

TRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS TRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS SHAI SHALEV-SHWARTZ, NATHAN SREBRO, AND TONG ZHANG Abstract We study the problem of minimizing the expected loss of a linear

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks and Nonparametric Methods CMPSCI 383 Nov 17, 2011! Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS LEARNING KERNEL-BASED HALFSPACES WITH THE 0- LOSS SHAI SHALEV-SHWARTZ, OHAD SHAMIR, AND KARTHIK SRIDHARAN Abstract. We describe and analyze a new algorithm for agnostically learning kernel-based halfspaces

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives: 10 NEURAL NETWORKS TODO The first learning models you learned about (decision trees and nearest neighbor models) created complex, non-linear decision boundaries. We moved from there to the perceptron,

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs artifical neural networks

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

CS:4420 Artificial Intelligence

CS:4420 Artificial Intelligence CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Understanding Machine Learning A theory Perspective

Understanding Machine Learning A theory Perspective Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

MythBusters: A Deep Learning Edition

MythBusters: A Deep Learning Edition 1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

ShareBoost: Efficient Multiclass Learning with Feature Sharing

ShareBoost: Efficient Multiclass Learning with Feature Sharing ShareBoost: Efficient Multiclass Learning with Feature Sharing Shai Shalev-Shwartz Yonatan Wexler Amnon Shashua Abstract Multiclass prediction is the problem of classifying an object into a relevant target

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Online Learning Summer School Copenhagen 2015 Lecture 1

Online Learning Summer School Copenhagen 2015 Lecture 1 Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity

More information