Inf2b Learning and Data

Similar documents
Inf2b Learning and Data

Inf2b Learning and Data

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Multi-layer Neural Networks

Linear Models for Classification

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Single layer NN. Neuron Model

Ch 4. Linear Models for Classification

More about the Perceptron

SGN (4 cr) Chapter 5

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

y(x n, w) t n 2. (1)

Artificial Neural Networks The Introduction

Neural Networks (Part 1) Goals for the lecture

Machine Learning Lecture 5

In the Name of God. Lecture 11: Single Layer Perceptrons

Machine Learning Lecture 7

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Kernelized Perceptron Support Vector Machines

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning 2017

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Linear Discrimination Functions

Linear discriminant functions

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

An Introduction to Statistical and Probabilistic Linear Models

The Perceptron. Volker Tresp Summer 2016

Advanced statistical methods for data analysis Lecture 2

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2004, Lecture 1: Introduction and Basic Concepts Yann LeCun

Linear Models for Classification

Multilayer Perceptrons and Backpropagation

Reading Group on Deep Learning Session 1

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

The Perceptron. Volker Tresp Summer 2014

Chapter 2 Single Layer Feedforward Networks

Logistic Regression. COMP 527 Danushka Bollegala

3.4 Linear Least-Squares Filter

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

The Perceptron Algorithm

Midterm: CS 6375 Spring 2015 Solutions

Linear & nonlinear classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to Neural Networks

Machine Learning Lecture 10

CSC Neural Networks. Perceptron Learning Rule

Neural networks. Chapter 20. Chapter 20 1

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Course 395: Machine Learning - Lectures

Artificial Neural Network

Machine Learning for NLP

ECS171: Machine Learning

Neural Networks and Deep Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Learning Linear Detectors

Lecture 4: Perceptrons and Multilayer Perceptrons

Support Vector Machines

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Artificial Neural Networks. Historical description

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

CMSC 421: Neural Computation. Applications of Neural Networks

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Learning from Data Logistic Regression

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Linear Discriminant Functions

Intelligent Systems Discriminative Learning, Neural Networks

Part 8: Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 5 Multivariate Linear Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning Basics III

Lecture 4: Feed Forward Neural Networks

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Supervised Learning. George Konidaris

Stochastic gradient descent; Classification

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Networks of McCulloch-Pitts Neurons

Logistic regression and linear classifiers COMS 4771

Neural Networks: Introduction

Neural networks. Chapter 19, Sections 1 5 1

Introduction To Artificial Neural Networks

The Perceptron. Volker Tresp Summer 2018

Lecture 3: Pattern Classification. Pattern classification

Multilayer Neural Networks

Machine Learning (CSE 446): Neural Networks

Artificial Neuron (Perceptron)

Neural Networks Lecture 2:Single Layer Classifiers

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Introduction to Machine Learning

Transcription:

Inf2b Learning and Data Lecture : Single layer Neural Networks () (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/ https://piazza.com/ed.ac.uk/spring209/infr08009inf2blearning Office hours: Wednesdays at 4:00-5:00 in IF-3.04 Jan-Mar 209 Inf2b Learning and Data: Lecture Single layer Neural Networks ()

Today s Schedule Discriminant functions (recap) 2 Decision boundary of linear discriminants (recap) 3 Discriminative training of linear discriminans (Perceptron) 4 Structures and decision boundaries of Perceptron 5 LSE Training of linear discriminants 6 Appendi - calculus, gradient descent, linear regression Inf2b Learning and Data: Lecture Single layer Neural Networks () 2

Discriminant functions (recap) y k () = ln (P( C)P(C k )) = 2 ( µ k) T k ( µ k) 2 ln k + ln P(C k ) = 2 T k + µt k k 2 µt k k µ k 2 ln k + ln P(C k ) (a) (b) (c) Inf2b Learning and Data: Lecture Single layer Neural Networks () 3

Linear discriminants for a 2-class problem y () = w T + w 0 y 2 () = w T 2 + w 20 Combined discriminant function: y() = y () y 2 () = (w w 2 ) T + (w 0 w 20 ) = w T + w 0 Decision: {, if y() 0, C = 2, if y() < 0 Inf2b Learning and Data: Lecture Single layer Neural Networks () 4

Decision boundary of linear discriminants Decision boundary: y() = w T + w 0 = 0 Dimension Decision boundary 2 line w + w 2 2 + w 0 = 0 3 plane w + w 2 2 + w 3 3 + w 0 = 0 D hyperplane ( D i= w i i ) + w 0 = 0 NB: w is a normal vector to the hyperplane Inf2b Learning and Data: Lecture Single layer Neural Networks () 5

Decision boundary of linear discriminant (2D) y() = w + w 2 2 + w 0 = 0 ( 2 = w w 2 w 0 w 2, when w 2 0) C T y( )= w +w0 =0 C w =( w, w2 ) slope = w /w 2 0 2 2 T w 0 /w 2 slope = w /w 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 6

Decision boundary of linear discriminant (3D) y() = w + w 2 2 + w 3 3 + w 0 = 0 3 w=(w, w, w ) 2 3 T 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 7

Approach to linear discminant functions Generative models : p( C k ) Discriminant function based on Bayes decision rule y k () = ln p( C k ) + ln P(C k ) Gaussian pdf (model) y k () = 2 ( µ k) T k ( µ k) 2 ln k + ln P(C k ) Equal covariance assumption y k () = w T + w 0 Why not estimating the decision boundary or P(C k ) directly? Discriminative trainig / models (Logistic regression, Percepton / Neural network, SVM) Inf2b Learning and Data: Lecture Single layer Neural Networks () 8

Training linear discriminant functions directly A discriminant for a two-class problem: y() = y () y 2 () = (w w 2 ) T + (w 0 w 20 ) = w T + w 0 2 ( w,w) T 2 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 9

Perceptron error correction algorithm a(ẋ) = w T + w 0 = ẇ T ẋ where ẇ = (w 0, w T ) T, ẋ = (, T ) T Let s just use w and to denote ẇ and ẋ from now on! y() = g( a() ) = g( w T ) where g(a) = {, if a 0, 0, if a < 0 Training set : D = {(, t ),..., ( N, t N )} Modify w if i was misclassified g(a): activation / transfer function where t i {0, } : target value NB: w (new) w + η (t i y( i )) i (0 < η < ) learning rate (w (new) ) T i = w T i + η (t i y( i )) i 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 0

Geometry of Perceptron s error correction y( i ) = g(w T i ) w (new) w + η (t i y( i )) i (0 < η < ) y( i ) t i y( i ) 0 0 0 - t i 0 w T = w cos θ 2 C T w w C 0 Inf2b Learning and Data: Lecture Single layer Neural Networks ()

Geometry of Perceptron s error correction (cont.) y( i ) = g(w T i ) w (new) w + η (t i y( i )) i (0 < η < ) y( i ) t i y( i ) 0 0 0 - t i 0 2 C w w T = w cos θ C 0 T w Inf2b Learning and Data: Lecture Single layer Neural Networks () 2

Geometry of Perceptron s error correction (cont.) y( i ) = g(w T i ) w (new) w + η (t i y( i )) i (0 < η < ) y( i ) t i y( i ) 0 0 0 - t i 0 η 2 (new) w η w C w T = w cos θ C 0 Inf2b Learning and Data: Lecture Single layer Neural Networks () 3

The Perceptron learning algorithm Incremental (online) Perceptron algorithm: for i =,..., N w w + η (t i y( i )) i Batch Perceptron algorithm: v sum = 0 for i =,..., N v sum = v sum + (t i y( i )) i w w + η v sum What about convergence? The Perceptron learning algorithm terminates if training samples are linearly separable. Inf2b Learning and Data: Lecture Single layer Neural Networks () 4

Linearly separable vs linearly non-separable (a ) (a 2) Linearly separable (b) Linearly non separable Inf2b Learning and Data: Lecture Single layer Neural Networks () 5

Background of Perceptron w w 2 w 3 y (https://en.wikipedia.org/wiki/file:neuron Hand-tuned.svg) (a) function unit 940s Warren McCulloch and Walter Pitts : threshold logic Donald Hebb : Hebbian learning 957 Frank Rosenblatt : Perceptron Inf2b Learning and Data: Lecture Single layer Neural Networks () 6

Character recognition by Perceptron (,) W j H i (W,H) Inf2b Learning and Data: Lecture Single layer Neural Networks () 7

Perceptron structures and decision boundaries y() = g(a()) = g( w T ) 2 w = (w 0, w,..., w D ) T = (,,..., D ) T {, if a 0, where g(a) = 0, if a < 0 w 0 w w 2 2 a() = + 2 = w 0 +w +w 2 2 0 2 w 0 =, w =, w 2 = NB: A one node/neuron constructs a decision boundary, which splits the input space into two regions Inf2b Learning and Data: Lecture Single layer Neural Networks () 8

Perceptron as a logical function NOT y 0 0 OR 2 y 0 0 0 0 0 NAND 2 y 0 0 0 0 0 OR 2 y 0 0 0 0 0 0 2 2 2 0 0 0 Question: find the weights for each function Inf2b Learning and Data: Lecture Single layer Neural Networks () 9

Perceptron structures and decision boundaries (cont.) 0 2 0 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 20

Perceptron structures and decision boundaries (cont.) 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 2

Perceptron structures and decision boundaries (cont.) 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 22

Perceptron structures and decision boundaries (cont.) 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 23

Problems with the Perceptron learning algorithm No training algorithms for multi-layer Percepton Non-convergence for linearly non-separable data Weights w are adjusted for misclassified data only (correctly classified data are not considered at all) Consider not only mis-classification (on train data), but also the optimality of decision boundary Least squares error training Large margin classifiers (e.g. SVM) Inf2b Learning and Data: Lecture Single layer Neural Networks () 24

Training with least squares Squared error function: E(w) = N ( ) w T 2 n t n 2 n= Optimisation problem: min E(w) w One way to solve this is to apply gradient descent (steepest descent): w w η w E(w) where η : step size (a small positive const.) ( ) T E E w E(w) =,... w 0 w D Inf2b Learning and Data: Lecture Single layer Neural Networks () 25

Training with least squares (cont.) E w i = = = w i 2 N n= N ( ) w T 2 n t n n= ( w T n t n ) w i w T n N ( ) w T n t n ni n= Trainable in linearly non-separable case Not robust (sensitive) against errornous data (outliers) far away from the boundary More or less a linear discriminant Inf2b Learning and Data: Lecture Single layer Neural Networks () 26

Appendi derivatives Derivatives of functions of one variable df d = f f ( + ɛ) f () () = lim ɛ 0 ɛ e.g., f () = 4 3, f () = 2 2 Partial derivatives of functions of more than one variable f = lim f ( + ɛ, y) f (, y) ɛ 0 ɛ e.g., f (, y) = y 3 2, f = 2y 3 Inf2b Learning and Data: Lecture Single layer Neural Networks () 27

Derivative rules Function Derivative Constant c 0 Power n n n 2 2 2 Eponential e e Logarithms ln() Sum rule f () + g() f () + g () Product rule f ()g() f ()g() + f ()g () Reciprocal rule f () f () g() f () f 2 () f ()g() f ()g () g 2 () Chain rule f (g()) f (g())g () z = f (y), y = g() dz d = dz dy dy d Inf2b Learning and Data: Lecture Single layer Neural Networks () 28

Vectors of derivatives Consider f (), where = (,..., D ) T Notation: all partial derivatives put in a vector: ( f f () =, f,, f ) T 2 D Eample: f () = 3 2 2 ( 3 2 2 2 f () = 2 3 2 ) Fact: f () changes most quickly in direction f () Inf2b Learning and Data: Lecture Single layer Neural Networks () 29

Gradient descent (steepest descent) First order optimisation algorithm using f () Optimisation problem: min f () Useful when analytic solutions (closed forms) are not available or difficult to find Algorithm Set an initial value 0 and set t = 0 2 If f ( t ) 0, then stop. Otherwise, do the following. 3 t+ = t η f ( t ) for η > 0 4 t = t +, and go to step 2. Problem: stops at a local minimum (difficult to find a global maimum). Inf2b Learning and Data: Lecture Single layer Neural Networks () 30

Linear regression (one variable) least squares line fitting Training set: D = {( n, t n )} N n= Linear regression: ˆt n = a n + b N Objective function: E = (t i (a i + b)) 2 n= n= Optimisation problem: min E a,b Partial derivatives: E N a = 2 ((t i (a i + b)) ( i ) n= N N N = 2a i 2 + 2b i 2 t i i n= E N b = 2 ((t i (a i + b)) n= N i + 2b N = 2a 2 t i n= n= n= Inf2b Learning and Data: Lecture Single layer Neural Networks () 3 n= N

Linear regression (one variable) (cont.) Letting E E = 0 and a b = 0 ( N n= i 2 N ) ( n= i a N N n= i n= b ) = ( N n= t i i N n= t i ) Inf2b Learning and Data: Lecture Single layer Neural Networks () 32

Linear regression (multiple variables) Y 2 E 3.. Linear least squares fitting with. We seek the linear function of that minie sum of squared residuals from Y. Training set: D = {( n, t n )} N n=, where n = (,,..., D ) T Linear regression: ˆt n = w T n Objective function: N ( ) E = tn w T 2 n n= Optimisation problem: min a,b E Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Inf2b Learning and Data: Lecture Single layer Neural Networks () 33

Linear regression (multiple variables) (cont.) E = N ( ) tn w T 2 n n= E N Partial derivatives: = 2 (t n w T n ) ni w i n= Vector/matri representation (NE) : T 0,..., d t =. =.., T =. N T N0,..., Nd t N E = (T w) T (T w) E w = 2 T (T W ) Letting E = 0 T (T W ) = 0 w T W = T T W = ( T ) T T analytic solution if the inverse eists Inf2b Learning and Data: Lecture Single layer Neural Networks () 34

Summary Training discriminant functions directly (discriminative training) Perceptron training algorithm Perceptron error correction algorithm Least squares error + gradient descent algorithm Linearly separable vs linearly non-separable Perceptron structures and decision boundaries See Notes for a Perceptron with multiple output nodes Inf2b Learning and Data: Lecture Single layer Neural Networks () 35