Perceptron. Inner-product scalar Perceptron. XOR problem. Gradient descent Stochastic Approximation to gradient descent 5/10/10

Similar documents
ME 539, Fall 2008: Learning-Based Control

Week 1, Lecture 2. Neural Network Basics. Announcements: HW 1 Due on 10/8 Data sets for HW 1 are online Project selection 10/11. Suggested reading :

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Linear Associator Linear Layer

10-701/ Machine Learning Mid-term Exam Solution

Multilayer perceptrons

Mixtures of Gaussians and the EM Algorithm

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Machine Learning Brett Bernstein

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

An Introduction to Neural Networks

Introduction to Machine Learning DIS10

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Empirical Process Theory and Oracle Inequalities

LECTURE 17: Linear Discriminant Functions

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Template matching. s[x,y] t[x,y] Problem: locate an object, described by a template t[x,y], in the image s[x,y] Example

Classification with linear models

Step 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b

Expectation-Maximization Algorithm.

Pattern Classification, Ch4 (Part 1)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning. Ilya Narsky, Caltech

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

FIR Filters. Lecture #7 Chapter 5. BME 310 Biomedical Computing - J.Schesser

Support Vector Machines and Kernel Methods

6.867 Machine learning, lecture 7 (Jaakkola) 1

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

1 Review of Probability & Statistics

Vector Quantization: a Limiting Case of EM

Intro to Learning Theory

Statistical Pattern Recognition

Optimally Sparse SVMs

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Linear Classifiers III

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

CS 2750 Machine Learning. Lecture 23. Concept learning. CS 2750 Machine Learning. Concept Learning

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Nonlinear regression

ARIMA Models. Dan Saunders. y t = φy t 1 + ɛ t

Achieving Stationary Distributions in Markov Chains. Monday, November 17, 2008 Rice University

Support vector machine revisited

The minimum value and the L 1 norm of the Dirichlet kernel

Machine Learning Theory (CS 6783)

Problem Set 4 Due Oct, 12

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science. BACKGROUND EXAM September 30, 2004.

Machine Learning Brett Bernstein

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Infinite Sequences and Series

Lesson 10: Limits and Continuity

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Signals and Systems. Problem Set: From Continuous-Time to Discrete-Time

JANE PROFESSOR WW Prob Lib1 Summer 2000

1 Review and Overview

Solution of Final Exam : / Machine Learning

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Discrete-Time Signals and Systems. Discrete-Time Signals and Systems. Signal Symmetry. Elementary Discrete-Time Signals.

Math 220B Final Exam Solutions March 18, 2002

18.657: Mathematics of Machine Learning

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

EE / EEE SAMPLE STUDY MATERIAL. GATE, IES & PSUs Signal System. Electrical Engineering. Postal Correspondence Course

Time-Domain Representations of LTI Systems

Ω ). Then the following inequality takes place:

Random Variables, Sampling and Estimation

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

MA131 - Analysis 1. Workbook 3 Sequences II

Dept. of Biomed. Eng. BME801: Inverse Problems in Bioengineering Kyung Hee Univ.

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

ECE 308 Discrete-Time Signals and Systems

CS 2750 Machine Learning. Lecture 22. Concept learning. CS 2750 Machine Learning. Concept Learning

Stat 421-SP2012 Interval Estimation Section

6.867 Machine learning

Machine Learning: Logistic Regression. Lecture 04

Bayesian Methods: Introduction to Multi-parameter Models

Orthogonal Gaussian Filters for Signal Processing

Lesson 03 Heat Equation with Different BCs

Lecture 2 October 11

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Lecture 15: Learning Theory: Concentration Inequalities

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

IP Reference guide for integer programming formulations.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Lecture 7: October 18, 2017

Algorithms for Clustering

ELEG 4603/5173L Digital Signal Processing Ch. 1 Discrete-Time Signals and Systems

Learning Bounds for Support Vector Machines with Learned Kernels

1 Hash tables. 1.1 Implementation

Transcription:

Perceptro Ier-product scalar Perceptro Perceptro learig rule XOR problem liear separable patters Gradiet descet Stochastic Approximatio to gradiet descet LMS Adalie 1

Ier-product et =< w, x >= w x cos(θ) et = i=1 w i x i A measure of the projectio of oe vector oto aother Example 2

Activatio fuctio o = f (et) = f ( w i x i ) i=1 f (x) := sg(x) = 1 if x 0 1 if x < 0 f (x) := ϕ(x) = 1 if x 0 0 if x < 0 1 if x 0.5 f (x) := ϕ(x) = x if 0.5 > x > 0.5 0 if x 0.5 sigmoid fuctio f (x) := σ(x) = 1 1+ e ( ax ) 3

Graphical represetatio Similarity to real euros... 4

10 40 Neuros 10 4-5 coectios per euro Perceptro x 1 x 2. x Liear threshold uit (LTU) w 2 w w 1 X 0 =1 Σ w 0 1 if et 0 et = w i x i o = sg(et) = 1 if et < 0 i= 0 The bias, a costat term that does ot deped o ay iput value o McCulloch-Pitts model of a euro 5

The goal of a perceptro is to correctly classify the set of patter D={x 1,x 2,..x m } ito oe of the classes C 1 ad C 2 The output for class C 1 is o=1 ad fo C 2 is o=-1 For =2 Liearly separable patters o = sg( w i x i ) w i x i > 0 for C 0 i= 0 i= 0 w i x i 0 for C 1 i= 0 X 0 =1, bias... 6

Perceptro learig rule Cosider liearly separable problems How to fid appropriate weights Look if the output patter o belogs to the desired class, has the desired value d w ew = w old + Δw Δw = η(d o) η is called the learig rate 0 < η 1 I supervised learig the etwork has its output compared with kow correct aswers Supervised learig Learig with a teacher (d-o) plays the role of the error sigal 7

Perceptro The algorithm coverges to the correct classificatio if the traiig data is liearly separable ad η is sufficietly small Whe assigig a value to η we must keep i mid two coflictig requiremets Averagig of past iputs to provide stable weights estimates, which requires small η Fast adaptatio with respect to real chages i the uderlyig distributio of the process resposible for the geeratio of the iput vector x, which requires large η Several odes o 1 = sg( w 1i x i ) i= 0 o 2 = sg( w 2i x i ) i= 0 8

o j = sg( w ji x i ) i= 0 Costructios 9

5/10/10 Frak Roseblatt 1928-1969 10

5/10/10 11

Roseblatt's bitter rival ad professioal emesis was Marvi Misky of Caregie Mello Uiversity Misky despised Roseblatt, hated the cocept of the perceptro, ad wrote several polemics agaist him For years Misky crusaded agaist Roseblatt o a very asty ad persoal level, icludig cotactig every group who fuded Roseblatt's research to deouce him as a charlata, hopig to rui Roseblatt professioally ad to cut off all fudig for his research i eural ets XOR problem ad Perceptro By Misky ad Papert i mid 1960 12

Gradiet Descet To uderstad, cosider simpler liear uit, where o = i= 0 w i x i Let's lear w i that miimize the squared error, D={(x 0,t 0 ),(x 1,t 1 ),..,(x,t )} (t for target) o bias ay more 13

Error for differet hypothesis, for w 0 ad w 1 (dim 2) We wat to move the weigth vector i the directio that decrease E w i =w i +Δw i w=w+δw 14

The gradiet Differetiatig E 15

Update rule for gradiet decet Δw i = η d D (t d o d )x id Gradiet Descet Gradiet-Descet(traiig_examples, η) Each traiig example is a pair of the form <(x 1, x ),t> where (x 1,,x ) is the vector of iput values, ad t is the target output value, η is the learig rate (e.g. 0.1) Iitialize each w i to some small radom value Util the termiatio coditio is met, Do Iitialize each Δw i to zero For each <(x 1, x ),t> i traiig_examples Do Iput the istace (x 1,,x ) to the liear uit ad compute the output o For each liear uit weight w i Do Δw i = Δw i + η (t-o) x i Δw i = η (t d o d )x id For each liear uit weight w i d D Do w i =w i +Δw i 16

Stochastic Approximatio to gradiet descet Δw i = η(t o)x i The gradiet decet traiig rule updates summig over all the traiig examples D Stochastic gradiet approximates gradiet decet by updatig weights icremetally Calculate error for each example Kow as delta-rule or LMS (last mea-square) weight update Adalie rule, used for adaptive filters Widroff ad Hoff (1960) 17

LMS Estimate of the weight vector No steepest decet No well defied trajectory i the weight space Istead a radom trajectory (stochastic gradiet descet) Coverge oly asymptotically toward the miimum error Ca approximate gradiet descet arbitrarily closely if made small eough η Summary Perceptro traiig rule guarateed to succeed if Traiig examples are liearly separable Sufficietly small learig rate η Liear uit traiig rule uses gradiet descet or LMS guarateed to coverge to hypothesis with miimum squared error Give sufficietly small learig rate η Eve whe traiig data cotais oise Eve whe traiig data ot separable by H 18

Ier-product scalar Perceptro Perceptro learig rule XOR problem liear separable patters Gradiet descet Stochastic Approximatio to gradiet descet LMS Adalie XOR? Multi-Layer Networks output layer hidde layer iput layer 19