Learning From Data Lecture 2 The Perceptron

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA Other Views of Learning Is Learning Feasible: A Puzzle M. Magdon-Ismail CSCI 4100/6100

recap: The Plan 1. What is Learning? 2.Can We do it? 3.How to do it? 4.How to do it well? 5. General principles? concepts theory practice 6. Advanced techniques. 7. Other Learning Paradigms. our language will be mathematics......our sword will be computer algorithms c AM L Creator: Malik Magdon-Ismail The Perceptron: 2 /25 Recap: key players

recap: The Key Players Salary, debt, years in residence,... input x R d = X. Approve credit or not output y { 1,+1} = Y. True relationship between x and y target function f : X Y. (The target f is unknown.) Data on customers data set D = (x 1,y 1 ),...,(x N,y N ). (y n = f(x n ).) X Y and D are given by the learning problem; The target f is fixed but unknown. We learn the function f from the data D. c AM L Creator: Malik Magdon-Ismail The Perceptron: 3 /25 Recap: learning setup

recap: Summary of the Learning Setup UNKNOWN TARGET FUNCTION f : X Y (ideal credit approval formula) y n = f(x n ) TRAINING EXAMPLES (x 1,y 1 ),(x 2,y 2 ),...,(x N,y N ) (historical records of credit customers) LEARNING ALGORITHM A FINAL HYPOTHESIS g f (learned credit approval formula) HYPOTHESIS SET H (set of candidate formulas) c AM L Creator: Malik Magdon-Ismail The Perceptron: 4 /25 Simple learning model

A Simple Learning Model Input vector x = [x 1,...,x d ] t. Give importance weights to the different inputs and compute a Credit Score d Credit Score = w i x i. Approve credit if the Credit Score is acceptable. d Approve credit if w i x i > threshold, ( Credit Score is good) Deny credit if i=1 i=1 d w i x i < threshold. ( Credit Score is bad) i=1 How to choose the importance weights w i input x i is important = large weight w i input x i beneficial for credit = positive weight w i > 0 input x i detrimental for credit = negative weight w i < 0 c AM L Creator: Malik Magdon-Ismail The Perceptron: 5 /25 Rewriting the model

A Simple Learning Model Approve credit if Deny credit if d w i x i > threshold, i=1 d w i x i < threshold. i=1 can be written formally as (( d ) h(x) = sign w i x i i=1 +w 0 ) The bias weight w 0 corresponds to the threshold. (How?) c AM L Creator: Malik Magdon-Ismail The Perceptron: 6 /25 Perceptron

The Perceptron Hypothesis Set We have defined a Hyopthesis set H H = {h(x) = sign(w t x)} uncountably infinite H w = w 0 w 1. w d Rd+1, x = 1 x 1. {1} Rd. x d This hypothesis set is called the perceptron or linear separator c AM L Creator: Malik Magdon-Ismail The Perceptron: 7 /25 Geometry of perceptron

Geometry of The Perceptron h(x) = sign(w t x) (Problem 1.2 in LFD) Which one should we pick? c AM L Creator: Malik Magdon-Ismail The Perceptron: 8 /25 Use the data

Use the Data to Pick a Line A perceptron fits the data by using a line to separate the +1 from 1 data. Fitting the data: How to find a hyperplane that separates the data? ( It s obvious - just look at the data and draw the line, is not a valid solution.) c AM L Creator: Malik Magdon-Ismail The Perceptron: 9 /25 How to learn g

How to Learn a Final Hypothesis g from H We want to select g H so that g f. We certainly want g f on the data set D. Ideally, g(x n ) = y n. How do we find such a g in the infinite hypothesis set H, if it exists? Idea! Start with some weight vector and try to improve it. c AM L Creator: Malik Magdon-Ismail The Perceptron: 10 /25 PLA

The Perceptron Learning Algorithm (PLA) A simple iterative method. 1: w(1) = 0 2: for iteration t = 1,2,3,... 3: the weight vector is w(t). 4: From (x 1,y 1 ),...,(x N,y N ) pick any misclassified example. 5: Call the misclassified example (x,y ), w(t) y x y = +1 w(t+1) x sign(w(t) x ) y. 6: Update the weight: y = 1 7: t t+1 w(t+1) = w(t)+y x. w(t+1) y x w(t) x PLA implements our idea: start at some weights and try to improve. incremental learning on a single example at a time c AM L Creator: Malik Magdon-Ismail The Perceptron: 11 /25 PLA convergence

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. What if the data cannot be fit by a perceptron? iteration 1 c AM L Creator: Malik Magdon-Ismail The Perceptron: 12 /25 Start

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 1 c AM L Creator: Malik Magdon-Ismail The Perceptron: 14 /25 Iteratrion 2

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 2 c AM L Creator: Malik Magdon-Ismail The Perceptron: 15 /25 Iteration 3

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 3 c AM L Creator: Malik Magdon-Ismail The Perceptron: 16 /25 Iteration 4

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 4 c AM L Creator: Malik Magdon-Ismail The Perceptron: 17 /25 Iteration 5

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 5 c AM L Creator: Malik Magdon-Ismail The Perceptron: 18 /25 Iteration 6

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 6 c AM L Creator: Malik Magdon-Ismail The Perceptron: 19 /25 Non-separable data?

We can Fit the Data We can find an h that works from infinitely many (for the perceptron). (So computationally, things seem good.) Ultimately, remember that we want to predict. We don t care about the data, we care about outside the data. Can a limited data set reveal enough information to pin down an entire target function, so that we can predict outside the data? c AM L Creator: Malik Magdon-Ismail The Perceptron: 21 /25 Other views of learning

Other Views of Learning Design: learning is from data, design is from specs and a model. Statistics, Function Approximation. Data Mining: find patterns in massive data (typically unsupervised). Three Learning Paradigms Supervised: the data is (x n,f(x n )) you are told the answer. Reinforcement: you get feedback on potential answers you try: x try something get feedback. Unsupervised: only given x n, learn to organize the data. c AM L Creator: Malik Magdon-Ismail The Perceptron: 22 /25 Coins supervised

Supervised Learning - Classifying Coins 25 Mass 1 5 Mass 1 5 25 10 10 Size Size c AM L Creator: Malik Magdon-Ismail The Perceptron: 23 /25 Coins unsupervised

Unsupervised Learning - Categorizing Coins type 4 Mass Mass type 3 type 1 type 2 Size Size c AM L Creator: Malik Magdon-Ismail The Perceptron: 24 /25 Puzzle: outside the data

Outside the Data Set - A Puzzle Dogs (f = 1) Trees (f = +1) Tree or Dog? (f =?) c AM L Creator: Malik Magdon-Ismail The Perceptron: 25 /25