Machine Learning Foundations

Similar documents
Machine Learning Foundations

Machine Learning Foundations

Machine Learning Techniques

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

Machine Learning Foundations

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Array 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Machine Learning Techniques

Polymorphism 11/09/2015. Department of Computer Science & Information Engineering. National Taiwan University

Encapsulation 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Inheritance 11/02/2015. Department of Computer Science & Information Engineering. National Taiwan University

Basic Java OOP 10/12/2015. Department of Computer Science & Information Engineering. National Taiwan University

Introduction to Bayesian Learning. Machine Learning Fall 2018

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 2 The Perceptron

Logistic Regression. Machine Learning Fall 2018

Homework #1 RELEASE DATE: 09/26/2013 DUE DATE: 10/14/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Large-Margin Thresholded Ensembles for Ordinal Regression

The PAC Learning Framework -II

Support Vector Machines

Large-Margin Thresholded Ensembles for Ordinal Regression

ECS171: Machine Learning

CPSC 340: Machine Learning and Data Mining

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Computational Learning Theory

IFT Lecture 7 Elements of statistical learning theory

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Bayesian Learning (II)

Learning From Data Lecture 14 Three Learning Principles

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Statistical Machine Learning Hilary Term 2018

Final Exam, Machine Learning, Spring 2009

Lecture 2 Machine Learning Review

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning From Data Lecture 3 Is Learning Feasible?

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CSC 411 Lecture 3: Decision Trees

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 3 Classification, Logistic Regression

12 Statistical Justifications; the Bias-Variance Decomposition

Introduction to Support Vector Machines

Bias-Variance Tradeoff

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

1 Machine Learning Concepts (16 points)

ECE521 Lecture7. Logistic Regression

ORIE 4741: Learning with Big Messy Data. Generalization

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Least Squares Regression

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

ECE 5424: Introduction to Machine Learning

6.867 Machine learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Qualifying Exam in Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning (67577) Lecture 3

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Logistic Regression Logistic

ECE521 week 3: 23/26 January 2017

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Least Squares Regression

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Learning Theory Continued

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 7 Introduction to Statistical Decision Theory

Final Exam, Fall 2002

ECS171: Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Midterm, Fall 2003

CPSC 540: Machine Learning

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Bayesian Machine Learning

Linear classifiers: Overfitting and regularization

Part of the slides are adapted from Ziko Kolter

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Learning From Data Lecture 10 Nonlinear Transforms

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Empirical Risk Minimization, Model Selection, and Model Assessment

FORMULATION OF THE LEARNING PROBLEM

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

y Xw 2 2 y Xw λ w 2 2

Association studies and regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Transcription:

Machine Learning Foundations ( 機器學習基石 ) Lecture 8: Noise and Error Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/25

1 When Can Machines Learn? 2 Why Can Machines Learn? Roadmap Lecture 7: The VC Dimension learning happens if finite d VC, large N, and low E in Lecture 8: Noise and Error Noise and Probabilistic Target Error Measure Algorithmic Error Measure Weighted Classification 3 How Can Machines Learn? 4 How Can Machines Learn Better? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/25

unknown target function f : X Y + noise Noise and Probabilistic Target Recap: The Learning Flow unknown P on X (ideal credit approval formula) x 1, x 2,, x N x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) what if there is noise? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25

Noise and Probabilistic Target Noise briefly introduced noise before pocket algorithm age 23 years gender female annual salary NTD 1,000,000 year in residence 1 year year in job 0.5 year current debt 200,000 credit? {no( 1), yes(+1)} but more! noise in y: good customer, mislabeled as bad? noise in y: same customers, different labels? noise in x: inaccurate customer information? does VC bound work under noise? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25

top top bottom bottom Noise and Error Noise and Probabilistic Target Probabilistic Marbles sample one key of VC bound: marbles! bin deterministic marbles marble x P(x) deterministic color f (x) h(x) probabilistic (noisy) marbles marble x P(x) probabilistic color y h(x) with y P(y x) same nature: can estimate P[orange] if i.i.d. VC holds for x i.i.d. P(x), y i.i.d. P(y x) }{{} (x,y) i.i.d. P(x,y) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

Noise and Probabilistic Target Target Distribution P(y x) characterizes behavior of mini-target on one x can be viewed as ideal mini-target + noise, e.g. P( x) = 0.7, P( x) = 0.3 ideal mini-target f (x) = flipping noise level = 0.3 deterministic target f : special case of target distribution P(y x) = 1 for y = f (x) P(y x) = 0 for y f (x) goal of learning: predict ideal mini-target (w.r.t. P(y x)) on often-seen inputs (w.r.t. P(x)) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

unknown target distribution P(y x) containing f (x) + noise Noise and Probabilistic Target (ideal credit approval formula) x 1, x 2,, x N y 1, y 2,, y N The New Learning Flow unknown P on X y x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) VC still works, pocket algorithm explained :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25

Noise and Probabilistic Target Fun Time Let s revisit PLA/pocket. Which of the following claim is true? 1 In practice, we should try to compute if D is linear separable before deciding to use PLA. 2 If we know that D is not linear separable, then the target function f must not be a linear function. 3 If we know that D is linear separable, then the target function f must be a linear function. 4 None of the above Reference Answer: 4 1 After computing if D is linear separable, we shall know w and then there is no need to use PLA. 2 What about noise? 3 What about sampling luck? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25

Error Measure Error Measure final hypothesis g f how well? previously, considered out-of-sample measure E out (g) = E x P g(x) f (x) more generally, error measure E(g, f ) naturally considered out-of-sample: averaged over unknown x pointwise: evaluated on one x classification: prediction target classification error... : often also called 0/1 error Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

Error Measure Pointwise Error Measure can often express E(g, f ) = averaged err(g(x), f (x)), like E out (g) = E x P g(x) f (x) }{{} err(g(x),f (x)) err: called pointwise error measure in-sample out-of-sample E in (g) = 1 N N err(g(x n ), f (x n )) n=1 E out (g) = E x P err(g(x), f (x)) will mainly consider pointwise err for simplicity Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25

Error Measure Two Important Pointwise Error Measures err g(x), f (x) }{{}}{{} ỹ y 0/1 error err(ỹ, y) = ỹ y correct or incorrect? often for classification squared error err(ỹ, y) = (ỹ y) 2 how far is ỹ from y? often for regression how does err guide learning? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

Error Measure Ideal Mini-Target interplay between noise and error: P(y x) and err define ideal mini-target f (x) P(y = 1 x) = 0.2, P(y = 2 x) = 0.7, P(y = 3 x) = 0.1 err(ỹ, y) = ỹ y 1 avg. err 0.8 2 avg. err 0.3( ) ỹ = 3 avg. err 0.9 1.9 avg. err 1.0(really? :-)) err(ỹ, y) = (ỹ y) 2 1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29( ) f (x) = argmax P(y x) y Y f (x) = y Y y P(y x) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

Error Measure unknown target distribution P(y x) containing f (x) + noise Learning Flow with Error Measure (ideal credit approval formula) x 1, x 2,, x N y 1, y 2,, y N unknown P on X y x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) error measure err extended VC theory/ philosophy works for most H and err Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25

Error Measure Fun Time Consider the following P(y x) and err(ỹ, y) = ỹ y. Which of the following is the ideal mini-target f (x)? P(y = 1 x) = 0.10, P(y = 2 x) = 0.35, P(y = 3 x) = 0.15, P(y = 4 x) = 0.40. 1 2 = weighted median from P(y x) 2 2.5 = average within Y = {1, 2, 3, 4} 3 2.85 = weighted mean from P(y x) 4 4 = argmax P(y x) Reference Answer: 1 For the absolute error, the weighted median provably results in the minimum average err. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25

Algorithmic Error Measure Choice of Error Measure Fingerprint Verification f +1 you 1 intruder two types of error: false accept and false reject g +1-1 +1 no error false reject f -1 false accept no error 0/1 error penalizes both types equally Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

Algorithmic Error Measure Fingerprint Verification for Supermarket Fingerprint Verification f +1 you 1 intruder two types of error: false accept and false reject g +1-1 +1 no error false reject f -1 false accept no error supermarket: fingerprint for discount f g +1-1 +1 0 10-1 1 0 false reject: very unhappy customer, lose future business false accept: give away a minor discount, intruder left fingerprint :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

Algorithmic Error Measure Fingerprint Verification for CIA Fingerprint Verification f +1 you 1 intruder two types of error: false accept and false reject g +1-1 +1 no error false reject f -1 false accept no error CIA: fingerprint for entrance false accept: very serious consequences! false reject: unhappy employee, but so what? :-) f g +1-1 +1 0 1-1 1000 0 Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

Algorithmic Error Measure Take-home Message for Now err is application/user-dependent Algorithmic Error Measures êrr true: just err plausible: 0/1: minimum flipping noise NP-hard to optimize, remember? :-) squared: minimum Gaussian noise friendly: easy to optimize for A closed-form solution convex objective function êrr: more in next lectures Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25

Algorithmic Error Measure Learning Flow with Algorithmic Error Measure unknown target distribution P(y x) containing f (x) + noise (ideal credit approval formula) x 1, x 2,, x N y 1, y 2,, y N unknown P on X y x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A err ˆ final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) error measure err err: application goal; êrr: a key part of many A Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25

Algorithmic Error Measure Fun Time Consider err below for CIA. What is E in (g) when using this err? f g +1-1 +1 0 1-1 1000 0 1 2 3 4 1 N 1 N 1 N 1 N N y n g(x n ) n=1 ( y n=+1 ( ( y n=+1 1000 y n g(x n ) + 1000 y n g(x n ) 1000 y n=+1 y n g(x n ) + y n= 1 y n= 1 y n= 1 y n g(x n ) y n g(x n ) y n g(x n ) ) ) ) Reference Answer: 2 When y n = 1, the false positive made on such (x n, y n ) is penalized 1000 times more! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25

Weighted Classification Weighted Classification CIA Cost (Error, Loss,...) Matrix out-of-sample in-sample E out (h) = E in (h) = 1 N y h(x) +1-1 +1 0 1-1 1000 0 { 1 if y = +1 E (x,y) P 1000 if y = 1 N n=1 { 1 if yn = +1 1000 if y n = 1 } y h(x) } y n h(x n ) weighted classification: different weight for different (x, y) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25

Weighted Classification Minimizing E in for Weighted Classification E w in (h) = 1 N Naïve Thoughts N n=1 { 1 if yn = +1 1000 if y n = 1 PLA: doesn t matter if linear separable. :-) } y n h(x n ) pocket: modify pocket-replacement rule if w t+1 reaches smaller E w in than ŵ, replace ŵ by w t+1 pocket: some guarantee on E 0/1 in ; modified pocket: similar guarantee on E w in? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/25

Weighted Classification Systematic Route: Connect E w in and E 0/1 in original problem y h(x) +1-1 +1 0 1-1 1000 0 equivalent problem y h(x) +1-1 +1 0 1-1 1 0 D: (x 1, +1) (x 2, 1) (x 3, 1)... (x N 1, +1) (x N, +1) (x 1, +1) (x 2, 1), (x 2, 1),..., (x 2, 1) (x 3, 1), (x 3, 1),..., (x 3, 1)... (x N 1, +1) (x N, +1) after copying 1 examples 1000 times, Ein w 0/1 for LHS Ein for RHS! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/25

Weighted Classification Weighted Pocket Algorithm y h(x) +1-1 +1 0 1-1 1000 0 using virtual copying, weighted pocket algorithm include: weighted PLA: randomly check 1 example mistakes with 1000 times more probability weighted pocket replacement: if w t+1 reaches smaller E w in than ŵ, replace ŵ by w t+1 systematic route (called reduction ): can be applied to many other algorithms! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/25

Weighted Classification Fun Time Consider the CIA cost matrix. If there are 10 examples with y n = 1 (intruder) and 999, 990 examples with y n = +1 (you). What would Ein w (h) be for a constant h(x) that always returns +1? y h(x) +1-1 +1 0 1-1 1000 0 1 0.001 2 0.01 3 0.1 4 1 Reference Answer: 2 While the quiz is a simple evaluation, it is not uncommon that the data is very unbalanced for such an application. Properly setting the weights can be used to avoid the lazy constant prediction. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 24/25

Weighted Classification Summary 1 When Can Machines Learn? 2 Why Can Machines Learn? Lecture 7: The VC Dimension Lecture 8: Noise and Error Noise and Probabilistic Target can replace f (x) by P(y x) Error Measure affect ideal target Algorithmic Error Measure user-dependent = plausible or friendly Weighted Classification easily done by virtual example copying next: more algorithms, please? :-) 3 How Can Machines Learn? 4 How Can Machines Learn Better? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 25/25