CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Similar documents
CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Learning From Data Lecture 10 Nonlinear Transforms

Binary Classification / Perceptron

Machine Learning. Lecture 9: Learning Theory. Feng Li.

ECS171: Machine Learning

Learning Theory Continued

ECS171: Machine Learning

Kernelized Perceptron Support Vector Machines

CS260: Machine Learning Algorithms

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Part of the slides are adapted from Ziko Kolter

Stochastic Gradient Descent

Overfitting, Bias / Variance Analysis

Understanding Generalization Error: Bounds and Decompositions

Linear & nonlinear classifiers

PAC-learning, VC Dimension and Margin-based Bounds

ECE521 Lecture7. Logistic Regression

Machine Learning Foundations

Machine Learning Lecture 7

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Optimization and Gradient Descent

Perceptron (Theory) + Linear Regression

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Logistic Regression. COMP 527 Danushka Bollegala

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks: Backpropagation

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

IFT Lecture 7 Elements of statistical learning theory

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Discriminative Models

Linear Regression (continued)

Machine Learning

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Voting (Ensemble Methods)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSC321 Lecture 4: Learning a Classifier

Single layer NN. Neuron Model

CSC321 Lecture 4: Learning a Classifier

Stochastic gradient descent; Classification

Learning From Data Lecture 7 Approximation Versus Generalization

Bias-Variance Tradeoff

Linear Models for Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning (CSE 446): Neural Networks

Warm up: risk prediction with logistic regression

Lecture 6. Notes on Linear Algebra. Perceptron

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Machine Learning

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

VBM683 Machine Learning

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Learning From Data Lecture 5 Training Versus Testing

ECE 5424: Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Empirical Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE521 week 3: 23/26 January 2017

Introduction to Neural Networks

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Generalization, Overfitting, and Model Selection

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Linear Models for Regression

The Perceptron algorithm

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support vector machines Lecture 4

Classification Logistic Regression

Neural Network Training

CS534 Machine Learning - Spring Final Exam

The Perceptron Algorithm

Lecture 4: Perceptrons and Multilayer Perceptrons

Ch 4. Linear Models for Classification

Machine Learning. Ensemble Methods. Manfred Huber

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Discriminative Models

Computational Learning Theory

1 Review of Winnow Algorithm

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Machine Learning and Data Mining. Linear classification. Kalev Kask

Linear & nonlinear classifiers

Introduction to Machine Learning

The Perceptron. Volker Tresp Summer 2014

Computational Learning Theory

CSC 411 Lecture 7: Linear Classification

SGN (4 cr) Chapter 5

Linear and Logistic Regression. Dr. Xiaowei Huang

Introduction to Machine Learning (67577) Lecture 3

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

DATA MINING AND MACHINE LEARNING

Transcription:

CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18

Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H Learned Hypothesis H 0! 2

Unknown Target Function!: # % Probability Distribution 3 on # Training Data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H Learned Hypothesis H 0! 3

Unknown Target Distribution!: # % plus noise Probability Distribution 3 on # Training Data Error Measure Formal Setup + = -., 0.,, - 2, 0 2 Learning Algorithm * 4 Hypothesis Set H Learned Hypothesis H (! 4

Hoeffding s Inequality (Validation) For a given hypothesis, h " #$ h = in-sample error of h " &'( (h) = out-of-sample error of h + " #$ h " &'( h >. 21 2345 $ As 6 increases, RHS decreases As. decreases, RHS increases

For a given finite hypothesis set, H = h $,, h ' ( )* + = in-sample error of best hypothesis in H Hoeffding s Inequality (Corrected) (,-. (+) = out-of-sample error of best hypothesis in H 1 ( )* + (,-. + > 4 2 7 8 9:;< * As = increases, RHS decreases As 4 decreases, RHS increases As 7 increases, RHS increases

! " $ =! " +! $!{" $} The Union Bound is bad A B 7

Given some finite sample of points! ",,! % from the input space and single hypothesis h H, applying h to each point in! ",,! % results in a dichotomy Dichotomy h! ",, h! % is a vector of ) +1 s and -1 s Given! ",,! %, each hypothesis in H generates a dichotomy but not necessarily a unique dichotomy! The set of dichotomies induced by Hon! ",,! % H! ",,! % = h! ",, h! % h H is 8

Growth Function The growth function of H is the largest number of dichotomies H can induce across all data sets of size " # H " = max ( ),,(,. H / 0,, / 1 9

Observe that! H # 2 & H and # Growth Function (Shattering) Given H, if ) *,, ) &. s.t. H ) *,, ) & = 2 &, then H shatters ) *,, ) & If ) *,, ) &! H # = 2 &. that is shattered by H, then 10

Growth Function (Break Points) If! H # < 2 &, then # is a break point for H If there is at least one break point for H, then! H ' is polynomial in ' 11

! = R and H = Positive rays: h & = '()* &, Growth Function: Example - H * = * + 1 & 0 & 1 & 2 & 4 & 5 & 6 & 371 & 370 & 3, 12

! = R and H = Positive intervals Growth Function: Example % H & = '() * + 1 = '- * + ' * + 1. ). *. /. 0. 1. 2. '3*. '3). ' 4 5 13

! = R $ and H = Convex sets Growth Function: Example & H ' = 2 ) * ) * + * $ *, * / * - *. 14

! " #$ % " '() % > + 4. H 21 2 34 5 67 $ Vapornik- Chervonenkis (VC)-Bound Or " '() % " #$ % + 5 $ log <= H >$? with probability at least 1 A 15

! "# H = the largest value of & s.t. ' H & = 2 ) The VC-dimension is the greatest number of points that can be shattered by H VC-Dimension If * is the smallest breakpoint for H, then! "# H = * 1 ' H & & / 01 + 1 3 456 7 3 8) 7 + 9! "# :;< ) ) 16

How many samples do we need in our training data to say that the generalization error is less than! with probability at least 1 $? Sample Complexity Set % & log * +, -&./0 1! Conclude that we need 3 % 5 6 log * +, -&. /0 1 As $ decreases, RHS increases As! decreases, RHS increases As 7 89 decreases, RHS decreases 17

Penalty for Model Complexity Given! samples, how good can we say our learned hypothesis will do with confidence at least 1 $? Conclude that % &'( ) % +, ) +., log 2 3, 4 5678 9 18

How well does % generalize? Approximation Generalization Tradeoff! "#$ %! '( % + * +,- log 1 1 How well does % approximate 2? 19

Increases as! "# increases Approximation Generalization Tradeoff $ %&' ( $ *+ ( + -! "# log 1 1 Decreases as! "# increases 20

How variable is '? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, How well, on average, does ' approximate 0? 21

Increases as H becomes more complex Bias-Variance Tradeoff " # $ %&' ( # = " + " # ( #, - (, - + (, 1, - Decreases as H becomes more complex 22

! "#$! "#$ Expected error! %& Expected error! %& Number of training points, ' Number of training points, ' Simple model Complex model 23

Expected error! "#$ Generalization error In-sample error! %& Expected error Variance Bias! "#$! %& Number of training points, ' Number of training points, ' VC analysis Bias-Variance analysis 24

Instead of bounding! "#$ % using! &' %, estimate! "#$ % using the error on some test dataset ( ),! $*+$ % Test Sets If the ( ) is not involved in the training process, then we are validating % using ( ) Therefore, Hoeffding s bound applies with, = H = 1 0! $*+$ %! "#$ % > 3 26 789: ' ; where < ) = ( ) As < ) increases, 26 789: ' ; decreases As < ) increases,! $*+$ % increases 25

3 Learning Problems Problem Domain Classification! = 1, +1 Predicting Probabilities! = [0, 1] Regression! = R 26

Linear Models h # = some function of / 0 # # = 1 # 2 # 3 # 5 27

3 Learning Solutions Problem Model Linear Classification h # = %&'( ) * # Logistic Regression h # = + ) * # Linear Regression h # = ) * # 28

Linear Classification Perceptron Given some input " = " $ = 1, " ',, " ) : ) h " = +,-. / 2 0 " 0 01$ 29

PLA finds a linear separator in finite time, if the data is linearly separable Perceptron Learning Algorithm Given: training data! = # $, & $,, # (, & ( Initialize ) to all zeros or (small) random numbers While some misclassified training example i.e. # +, & +! s.t. h # + =./01 ) 2 # + & + Randomly pick a misclassified training example, #, & Update ): ) = ) + & # 30

Perceptron Learning Algorithm Suppose ", $ & is a misclassified training example and $ = +1 * + " is negative After updating * = * + $ ", * + $ " + " = * + " + $ " + " is less negative than * + " Because $ > 0 and " + " > 0 A similar argument holds if $ = 1 31

#! "# $ = 1 ' ( ")* $ +, ". " / Linear Regression: Squared Error # = 1 ' ( + /, " $." ")* = 1 ' 0$. / where 6 = ( 6 / " = 6 + 6 7 ")* = 1 ' 0$. 8 0$. 32

Find the gradient Minimizing Error Set it equal to zero Solve (Check that the solution is a minimum) 33

! "# $ = 1 ' ($ +, ($ + = 1 ' ($ 2 +, ($ + Minimizing Error = 1 ' $2 ( 2 ($ 2$ 2 ( + + + 2 + /! "# $ = 1 ' 2(2 ($ 2( 2 + = 0 2( 2 ($ 2( 2 + = 0 ( 2 ($ = ( 2 + $ = ( 2 ( 56 ( 2 + 34

" # $% & = 1 ) 2+, +& 2+, / Checking 0 " # $% & = 1 ) 2+, + 0 " # $% & is (almost always) positive definite & = +, + 34 +, / is a unique global minimum 35

Input:! = # $, & $, # ', & ',, # ), & ) Linear Regression Algorithm 1. Construct * and & 2. Compute the pseudo-inverse of * = *, = * - *.$ * - 3. Compute / = X, & Output: / 36

Key observation: 1, +1 R Use linear regression to find ' = * + *,- * + / Linear Regression for Classification ' minimizes 8 59 ' = 1 3 : ' + 4 5 / < 5 5;- In general, 0123 ' + 4 5 / 5 1 9 Use ' for linear classification: 2 4 = 0123 ' + 4 37

Input:! = # $, & $,, # (, & (, ) 1. Initialize * to all zeros and +,-./ = 2. For 1 = 1, 2,, ) The Pocket Algorithm a. Randomly pick a misclassified training example, #, & b. Update *: * = * + & # c. If + 6( * < +,-./ I. +,-./ = + 6( * II. * = * Output: * 38

Training data does not consist of probabilities Observations are still binary:! " = ±1 Logistic Regression Goal is to learn & ( = )! = +1 ( h ( = -. / ( = 0 012 345 6 = 1 + 7895 : 80 0,1 Note that 1 -. / ( = -. / ( 39

Cross-entropy Error Some hypothesis h is good if: the probability of the training data " given h is high % # $% & = 1 ) * $+, ln 1 + 0 12 34 5 6 3 40

Gradient Descent: Intuition Iterative method for minimizing functions Requires the gradient to exist everywhere Particularly useful for minimizing convex functions, like the cross-entropy error 41

Suppose the current location is! (#) Gradient Descent: Intuition Move some distance, %, in the most downhill direction possible, &'! (#()) =! # + % &' 42

Fix # and choose $" to minimize Δ& '( after making the update ) (+,-) = ) + + # $" Δ& '( $" = & '( ) + + # $" & '( ) + " Δ& '( $" & '( ) + + # $" 3 5 & '( ) + & '( ) + Δ& '( $" # $" 3 5 & '( ) + Δ& '( $" = # 5 & '( ) + $" = 7 8 9: 5 ; 7 8 9: 5 ; 43

! " Small! Large! Variable! " Set! " =! $ & ' () * "! " decreases as + increases, because & ' () * " decreases as ' () * " approaches its minimum 44

Input:! = # $, & $,, # (, & (, ) * 1. Initialize + * to all zeros and set, = 0 2. While termination condition is not satisfied Gradient Descent a. Compute / 0 1( + 2 b. Update +: + 23$ = + 2 ) * / 0 1( + 2 c. Increment,:, =, + 1 Output: + 2 8 # = : & = +1 # = $ $3; <= >? @ 45

Stochastic Gradient Descent (SGD) Input:! = # $, & $,, # (, & (, ) * 1. Initialize + * to all zeros and set, = 0 2. While termination condition is not satisfied a. Pick a random data point in!, #, & b. Compute 0 1 + 2, #, & = 34 5 6 78 : 9 ; <$ c. Update +: + 2<$ = + 2 ) * 0 1 + 2, #, & = + 2 +? @4 5 6 78 : 9 ; <$ d. Increment,:, =, + 1 Output: + 2 C # = D & = +1 # = $ $<6 E8 9 : ; 46

Use logistic regression to find! " Logistic Regression for Classification Use! " for classification: if # $ = +1 ) = *! " + ) -. then classify ) as +1; otherwise, classify ) as 1 A ) = BCAD - -EF GH I J K -. 47

Use logistic regression to find! " Logistic Regression for Classification Use! " for classification: if # $ = +1 ) = *! " + ) - then classify ) as +1; otherwise, classify ) as 1 @ ) = AB@C D DEF GH I J K - 48

Fingerprint recognition: Inputs are fingerprints Outputs: +1 means you, -1 means not you Error: Classification For personalized coupons:! " +1-1 +1 0 1 # " -1 100 0 For unlocking phones:! " +1-1 +1 0 1000 # " -1 1 0 49

Decide on a transformation Φ: # % Nonlinear Models Convert & = ( ), + ),, ( -, + - to.& = Φ ( ) = / ), + ),, Φ ( - = / -, + - Fit a linear model using.&, 01 / Return the corresponding predictor in the original space: 1 ( = 01 Φ ( 50

Tradeoffs Low-Dimensional Transformations High-Dimensional Transformations! "# High Low Generalization Good Bad 51