Learning From Data Lecture 2 The Perceptron

Similar documents
Learning From Data Lecture 3 Is Learning Feasible?

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 25 The Kernel Trick

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Learning From Data Lecture 14 Three Learning Principles

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Linear models and the perceptron algorithm

CSC Neural Networks. Perceptron Learning Rule

Perceptron. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

Minimax risk bounds for linear threshold functions

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Learning From Data Lecture 7 Approximation Versus Generalization

ECS171: Machine Learning

Learning From Data Lecture 13 Validation and Model Selection

Training the linear classifier

Machine Learning Lecture 7

Active Learning and Optimized Information Gathering

More about the Perceptron

Machine Learning (CS 567) Lecture 3

CSC321 Lecture 4 The Perceptron Algorithm

Machine Learning (CS 567) Lecture 2

Machine Learning Foundations

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Linear Classifiers and the Perceptron

Linear Classification

Learning From Data Lecture 26 Kernel Machines

Homework #1 RELEASE DATE: 09/26/2013 DUE DATE: 10/14/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

SVMs: nonlinearity through kernels

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Machine Learning Foundations

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

ORIE 4741: Learning with Big Messy Data. Generalization

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning Linear Models

Warm up: risk prediction with logistic regression

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Linear Classifiers. Michael Collins. January 18, 2012

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Loss Functions, Decision Theory, and Linear Models

Linear Classifiers: Expressiveness

Recap from previous lecture

CS 543 Page 1 John E. Boon, Jr.

Lecture 5 Neural models for NLP

Linear Classifiers and the Perceptron Algorithm

Decision Trees: Overfitting

Lecture Support Vector Machine (SVM) Classifiers

Binary Classification / Perceptron

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COMP 875 Announcements

Foundations of Computer Science Lecture 23 Languages: What is Computation?

CSC242: Intro to AI. Lecture 21

Name (NetID): (1 Point)

Section 4.2: Mathematical Induction 1

ECE521 Lecture7. Logistic Regression

PAC-learning, VC Dimension and Margin-based Bounds

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Lecture 29: Computational Learning Theory

Support Vector Machines

Lecture 4: Training a Classifier

Learning From Data Lecture 12 Regularization

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Probability and Statistical Decision Theory

CS 188: Artificial Intelligence Spring Announcements

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

COSMOS: Making Robots and Making Robots Intelligent Lecture 6: Introduction to feedback control

Web-Mining Agents Computational Learning Theory

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Voting (Ensemble Methods)

Input layer. Weight matrix [ ] Output layer

Bias-Variance Tradeoff

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

CPSC 340: Machine Learning and Data Mining

GRADIENT DESCENT AND LOCAL MINIMA

Logistic Regression. Machine Learning Fall 2018

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Lecture 4: Linear predictors and the Perceptron

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Expansion of Terms. f (x) = x 2 6x + 9 = (x 3) 2 = 0. x 3 = 0

CS 188: Artificial Intelligence. Outline

Lecture 7 Introduction to Statistical Decision Theory

CS 188: Artificial Intelligence Spring Announcements

Practical Agnostic Active Learning

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

FINAL: CS 6375 (Machine Learning) Fall 2014

Statistical and Computational Learning Theory

Name (NetID): (1 Point)

Lecture 4: Training a Classifier

SGN (4 cr) Chapter 5

Induction of Decision Trees

CSC 411: Lecture 03: Linear Classification

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

2. If the values for f(x) can be made as close as we like to L by choosing arbitrarily large. lim

Transcription:

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA Other Views of Learning Is Learning Feasible: A Puzzle M. Magdon-Ismail CSCI 4100/6100

recap: The Plan 1. What is Learning? 2.Can We do it? 3.How to do it? 4.How to do it well? 5. General principles? concepts theory practice 6. Advanced techniques. 7. Other Learning Paradigms. our language will be mathematics......our sword will be computer algorithms c AM L Creator: Malik Magdon-Ismail The Perceptron: 2 /25 Recap: key players

recap: The Key Players Salary, debt, years in residence,... input x R d = X. Approve credit or not output y { 1,+1} = Y. True relationship between x and y target function f : X Y. (The target f is unknown.) Data on customers data set D = (x 1,y 1 ),...,(x N,y N ). (y n = f(x n ).) X Y and D are given by the learning problem; The target f is fixed but unknown. We learn the function f from the data D. c AM L Creator: Malik Magdon-Ismail The Perceptron: 3 /25 Recap: learning setup

recap: Summary of the Learning Setup UNKNOWN TARGET FUNCTION f : X Y (ideal credit approval formula) y n = f(x n ) TRAINING EXAMPLES (x 1,y 1 ),(x 2,y 2 ),...,(x N,y N ) (historical records of credit customers) LEARNING ALGORITHM A FINAL HYPOTHESIS g f (learned credit approval formula) HYPOTHESIS SET H (set of candidate formulas) c AM L Creator: Malik Magdon-Ismail The Perceptron: 4 /25 Simple learning model

A Simple Learning Model Input vector x = [x 1,...,x d ] t. Give importance weights to the different inputs and compute a Credit Score d Credit Score = w i x i. Approve credit if the Credit Score is acceptable. d Approve credit if w i x i > threshold, ( Credit Score is good) Deny credit if i=1 i=1 d w i x i < threshold. ( Credit Score is bad) i=1 How to choose the importance weights w i input x i is important = large weight w i input x i beneficial for credit = positive weight w i > 0 input x i detrimental for credit = negative weight w i < 0 c AM L Creator: Malik Magdon-Ismail The Perceptron: 5 /25 Rewriting the model

A Simple Learning Model Approve credit if Deny credit if d w i x i > threshold, i=1 d w i x i < threshold. i=1 can be written formally as (( d ) h(x) = sign w i x i i=1 +w 0 ) The bias weight w 0 corresponds to the threshold. (How?) c AM L Creator: Malik Magdon-Ismail The Perceptron: 6 /25 Perceptron

The Perceptron Hypothesis Set We have defined a Hyopthesis set H H = {h(x) = sign(w t x)} uncountably infinite H w = w 0 w 1. w d Rd+1, x = 1 x 1. {1} Rd. x d This hypothesis set is called the perceptron or linear separator c AM L Creator: Malik Magdon-Ismail The Perceptron: 7 /25 Geometry of perceptron

Geometry of The Perceptron h(x) = sign(w t x) (Problem 1.2 in LFD) Which one should we pick? c AM L Creator: Malik Magdon-Ismail The Perceptron: 8 /25 Use the data

Use the Data to Pick a Line A perceptron fits the data by using a line to separate the +1 from 1 data. Fitting the data: How to find a hyperplane that separates the data? ( It s obvious - just look at the data and draw the line, is not a valid solution.) c AM L Creator: Malik Magdon-Ismail The Perceptron: 9 /25 How to learn g

How to Learn a Final Hypothesis g from H We want to select g H so that g f. We certainly want g f on the data set D. Ideally, g(x n ) = y n. How do we find such a g in the infinite hypothesis set H, if it exists? Idea! Start with some weight vector and try to improve it. c AM L Creator: Malik Magdon-Ismail The Perceptron: 10 /25 PLA

The Perceptron Learning Algorithm (PLA) A simple iterative method. 1: w(1) = 0 2: for iteration t = 1,2,3,... 3: the weight vector is w(t). 4: From (x 1,y 1 ),...,(x N,y N ) pick any misclassified example. 5: Call the misclassified example (x,y ), w(t) y x y = +1 w(t+1) x sign(w(t) x ) y. 6: Update the weight: y = 1 7: t t+1 w(t+1) = w(t)+y x. w(t+1) y x w(t) x PLA implements our idea: start at some weights and try to improve. incremental learning on a single example at a time c AM L Creator: Malik Magdon-Ismail The Perceptron: 11 /25 PLA convergence

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. What if the data cannot be fit by a perceptron? iteration 1 c AM L Creator: Malik Magdon-Ismail The Perceptron: 12 /25 Start

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. What if the data cannot be fit by a perceptron? iteration 1 c AM L Creator: Malik Magdon-Ismail The Perceptron: 13 /25 Iteration 1

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 1 c AM L Creator: Malik Magdon-Ismail The Perceptron: 14 /25 Iteratrion 2

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 2 c AM L Creator: Malik Magdon-Ismail The Perceptron: 15 /25 Iteration 3

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 3 c AM L Creator: Malik Magdon-Ismail The Perceptron: 16 /25 Iteration 4

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 4 c AM L Creator: Malik Magdon-Ismail The Perceptron: 17 /25 Iteration 5

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 5 c AM L Creator: Malik Magdon-Ismail The Perceptron: 18 /25 Iteration 6

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 6 c AM L Creator: Malik Magdon-Ismail The Perceptron: 19 /25 Non-separable data?

Does PLA Work? Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one. After how long? What if the data cannot be fit by a perceptron? iteration 1 c AM L Creator: Malik Magdon-Ismail The Perceptron: 20 /25 We can fit!

We can Fit the Data We can find an h that works from infinitely many (for the perceptron). (So computationally, things seem good.) Ultimately, remember that we want to predict. We don t care about the data, we care about outside the data. Can a limited data set reveal enough information to pin down an entire target function, so that we can predict outside the data? c AM L Creator: Malik Magdon-Ismail The Perceptron: 21 /25 Other views of learning

Other Views of Learning Design: learning is from data, design is from specs and a model. Statistics, Function Approximation. Data Mining: find patterns in massive data (typically unsupervised). Three Learning Paradigms Supervised: the data is (x n,f(x n )) you are told the answer. Reinforcement: you get feedback on potential answers you try: x try something get feedback. Unsupervised: only given x n, learn to organize the data. c AM L Creator: Malik Magdon-Ismail The Perceptron: 22 /25 Coins supervised

Supervised Learning - Classifying Coins 25 Mass 1 5 Mass 1 5 25 10 10 Size Size c AM L Creator: Malik Magdon-Ismail The Perceptron: 23 /25 Coins unsupervised

Unsupervised Learning - Categorizing Coins type 4 Mass Mass type 3 type 1 type 2 Size Size c AM L Creator: Malik Magdon-Ismail The Perceptron: 24 /25 Puzzle: outside the data

Outside the Data Set - A Puzzle Dogs (f = 1) Trees (f = +1) Tree or Dog? (f =?) c AM L Creator: Malik Magdon-Ismail The Perceptron: 25 /25