Learning Neural Networks

Similar documents
Support Vector Machines

Lecture 5: Logistic Regression. Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression & Neural Networks

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Machine Learning Lecture 5

y(x n, w) t n 2. (1)

Linear Models for Classification

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks and the Back-propagation Algorithm

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Computational statistics

From perceptrons to word embeddings. Simon Šuster University of Groningen

Artificial Neural Networks

Multilayer Perceptron

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

4. Multilayer Perceptrons

10-701/ Machine Learning, Fall

Multilayer Perceptrons and Backpropagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lab 5: 16 th April Exercises on Neural Networks

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Artificial Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning (CS 567) Lecture 5

Machine Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Computational Intelligence Winter Term 2017/18

Machine Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 3 Feedforward Networks and Backpropagation

Neural Network Training

Computational Intelligence

Rapid Introduction to Machine Learning/ Deep Learning

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

CSC 411 Lecture 10: Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Deep Feedforward Networks

Machine Learning (CS 567) Lecture 3

Artificial Neural Networks 2

Midterm, Fall 2003

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Stochastic Gradient Descent

Lecture 3 Feedforward Networks and Backpropagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks and Deep Learning

Artificial Neural Networks

Artificial Neural Networks. MGS Lecture 2

Advanced Machine Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Chapter ML:VI (continued)

Lecture 4: Perceptrons and Multilayer Perceptrons

Chapter ML:VI (continued)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

NN V: The generalized delta learning rule

Feedforward Neural Nets and Backpropagation

Statistical Machine Learning from Data

Intro to Neural Networks and Deep Learning

Introduction to Machine Learning

Midterm: CS 6375 Spring 2018

Reading Group on Deep Learning Session 1

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Convolutional Neural Networks

Neural Networks. Intro to AI Bert Huang Virginia Tech

CSCI567 Machine Learning (Fall 2018)

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Linear Classifiers 1

Pattern Classification

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Introduction to Machine Learning

epochs epochs

Multilayer Neural Networks

Notes on Back Propagation in 4 Lines

Algorithms for Learning Good Step Sizes

The Multi-Layer Perceptron

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multi-layer Neural Networks

Lecture 17: Neural Networks and Deep Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Gradient Descent Training Rule: The Details

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Transcription:

Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic Continuous Parameters Learning Algorithms for neural networks Local Search. The same algorithm as for sigmoid threshold units Eager Batch or Online

Neural Network Hypothesis Space ŷ W 9 a 6 a 7 a 8 W 6 W 7 W 8 x 1 x 2 x 3 x 4 Each unit a 6, a 7, a 8, and ŷ computes a sigmoid function of its inputs: a 6 = σ(w 6 X) a 7 = σ(w 7 X) a 8 = σ(w 8 X) ŷ = σ(w 9 A) where A = [1, a 6, a 7, a 8 ] is called the vector of hidden unit activitations Original motivation: Differentiable approximation to multi-layer layer LTUs

Representational Power Any Boolean Formula Consider a formula in disjunctive rmal form: (x 1 x 2 ) (x 2 x 4 ) ( x 3 x 5 ) Each AND can be represented by a hidden unit and the OR can be represented by the output unit. Arbitrary boolean functions require exponentially-many hidden units, however. Bounded functions Suppose we make the output linear: ŷ = W 9 A of hidden units. It can be proved that any bounded continuous function can be approximated to arbitrary accuracy with eugh hidden units. Arbitrary Functions Any function can be approximated to arbitrary accuracy with two hidden layers of sigmoid units and a linear output unit.

Fixed versus Variable Size In principle, a network has a fixed number of parameters and therefore can only represent a fixed hypothesis space (if the number of hidden units is fixed). However, we will initialize the weights to values near zero and use gradient descent. The more steps of gradient descent we take, the more functions can be reached from the starting weights. So it turns out to be more accurate to treat networks as having a variable hypothesis space that depends on the number of steps of gradient descent

Backpropagation: Gradient Descent for Multi-Layer Networks It is traditional to train neural networks to minimize the squared error. This is really a mistake they they should be trained to maximize the log likelihood instead. But we will study the MSE first. ŷ = σ(w 9 [1, σ(w 6 X), σ(w 7 X), σ(w 9 X)]) J i (W )= 1 2 (ŷ i y i ) 2 We must apply the chain rule many times to compute the gradient We will number the units from 0 to U and index them by u and v. w v,u will be the weight connecting unit u to unit v. (Note: This seems backwards. It is the uth input to de v.)

Derivation: Output Unit Suppose w 9,6 is a component of W 9, the output weight vector, connecting it from a 6. J i (W) w 9,6 = w 9,6 1 2 (ŷ i y i ) 2 = 1 2 2 (ŷ i y i ) w 9,6 (σ(w 9 A i ) y i ) = (ŷ i y i ) σ(w 9 A i )(1 σ(w 9 A i )) = (ŷ i y i )ŷ i (1 ŷ i ) a 6 w 9,6 W 9 A i

The Delta Rule Define then δ 9 =(ŷ i y i )ŷ i (1 ŷ i ) J i (W) w 9,6 = (ŷ i y i )ŷ i (1 ŷ i ) a 6 = δ 9 a 6

Derivation: Hidden Units J i (W) w 6,2 = (ŷ i y i ) σ(w 9 A i )(1 σ(w 9 A i )) = δ 9 w 9,6 w 6,2 σ(w 6 X) = δ 9 w 9,6 σ(w 6 X)(1 σ(w 6 X)) = δ 9 w 9,6 a 6 (1 a 6 ) x 2 w 6,2 W 9 A i w 6,2 (W 6 X) Define δ 6 = δ 9 w 9,6 a 6 (1 a 6 ) and rewrite as J i (W ) w 6,2 = δ 6 x 2.

Networks with Multiple Output Units ŷ 1 ŷ 2 1 a 6 a 7 a 8 W 6 W 7 W 8 1 We get a separate contribution to the gradient from each output unit. Hence, for input-to to-hidden weights, we must sum up the contributions: δ 6 = a 6 (1 a 6 ) x 1 x 2 x 3 x 4 10 X w u,6 δu u=9

The Backpropagation Algorithm Forward Pass.. Compute a u and ŷ v for hidden units u and output units v. Compute Errors.. Compute ε v = (ŷ( v y v ) for each output unit v Compute Output Deltas.. Compute δ u = a u (1 a u ) v w v,u Compute Gradient. J i = δux ij w u,j Compute for input-to to-hidden weights. J i Compute = δ v a iu for hidden-to to-output output weights. wv,u Take Gradient Step. W := W η W J(x i ) v,u δ v

Proper Initialization Start in the linear regions keep all weights near zero, so that all sigmoid units are in their linear regions. This makes the whole net the equivalent of one linear threshold unit a a very simple function. Break symmetry. Ensure that each hidden unit has different input weights so that the hidden units move in different directions. Set each weight to a random number in the range [ 1, +1] 1 fan-in. where the fan-in of weight w v,u is the number of inputs to unit v.

Batch, Online, and Online with Momentum Batch.. Sum the W J(x i ) for each example i. Then take a gradient descent step. Online.. Take a gradient descent step with each W J(x i ) as it is computed. Momentum.. Maintain an exponentially-weighted moved sum of recent W (t+1) := µ W (t) + W J(x i ) W (t+1) := W (t) η W (t+1) Typical values of µ are in the range [0.7, 0.95]

Softmax Output Layer Let a 9 and a 10 be the output activations: a 9 = W 9 A, a 10 = W 10 A. Then define ŷ 1 = exp a 9 exp a 9 + exp a 10 ŷ 2 = The objective function is the negative log likelihood: J(W )= X i where I[expr] is 1 if expr is true and 0 otherwise X k ŷ 1 ŷ 2 softmax exp a 10 exp a 9 + exp a 10 I[y i = k]logŷ k W 9 W 10 1 a 6 a 7 a 8 W 6 W 7 W 8 1 x 1 x 2 x 3 x 4

Neural Network Evaluation Criterion Perc Logistic LDA Trees Nets Mixed data Missing values Outliers Motone transformations somewhat Scalability Irrelevant inputs somewhat Linear combinations Interpretable Accurate