Machine Learning Foundations ( 機器學習基石 ) Lecture 8: Noise and Error Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/25
1 When Can Machines Learn? 2 Why Can Machines Learn? Roadmap Lecture 7: The VC Dimension learning happens if finite d VC, large N, and low E in Lecture 8: Noise and Error Noise and Probabilistic Target Error Measure Algorithmic Error Measure Weighted Classification 3 How Can Machines Learn? 4 How Can Machines Learn Better? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/25
unknown target function f : X Y + noise Noise and Probabilistic Target Recap: The Learning Flow unknown P on X (ideal credit approval formula) x 1, x 2,, x N x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) what if there is noise? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25
Noise and Probabilistic Target Noise briefly introduced noise before pocket algorithm age 23 years gender female annual salary NTD 1,000,000 year in residence 1 year year in job 0.5 year current debt 200,000 credit? {no( 1), yes(+1)} but more! noise in y: good customer, mislabeled as bad? noise in y: same customers, different labels? noise in x: inaccurate customer information? does VC bound work under noise? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25
top top bottom bottom Noise and Error Noise and Probabilistic Target Probabilistic Marbles sample one key of VC bound: marbles! bin deterministic marbles marble x P(x) deterministic color f (x) h(x) probabilistic (noisy) marbles marble x P(x) probabilistic color y h(x) with y P(y x) same nature: can estimate P[orange] if i.i.d. VC holds for x i.i.d. P(x), y i.i.d. P(y x) }{{} (x,y) i.i.d. P(x,y) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25
Noise and Probabilistic Target Target Distribution P(y x) characterizes behavior of mini-target on one x can be viewed as ideal mini-target + noise, e.g. P( x) = 0.7, P( x) = 0.3 ideal mini-target f (x) = flipping noise level = 0.3 deterministic target f : special case of target distribution P(y x) = 1 for y = f (x) P(y x) = 0 for y f (x) goal of learning: predict ideal mini-target (w.r.t. P(y x)) on often-seen inputs (w.r.t. P(x)) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25
unknown target distribution P(y x) containing f (x) + noise Noise and Probabilistic Target (ideal credit approval formula) x 1, x 2,, x N y 1, y 2,, y N The New Learning Flow unknown P on X y x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) VC still works, pocket algorithm explained :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25
Noise and Probabilistic Target Fun Time Let s revisit PLA/pocket. Which of the following claim is true? 1 In practice, we should try to compute if D is linear separable before deciding to use PLA. 2 If we know that D is not linear separable, then the target function f must not be a linear function. 3 If we know that D is linear separable, then the target function f must be a linear function. 4 None of the above Reference Answer: 4 1 After computing if D is linear separable, we shall know w and then there is no need to use PLA. 2 What about noise? 3 What about sampling luck? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25
Error Measure Error Measure final hypothesis g f how well? previously, considered out-of-sample measure E out (g) = E x P g(x) f (x) more generally, error measure E(g, f ) naturally considered out-of-sample: averaged over unknown x pointwise: evaluated on one x classification: prediction target classification error... : often also called 0/1 error Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25
Error Measure Pointwise Error Measure can often express E(g, f ) = averaged err(g(x), f (x)), like E out (g) = E x P g(x) f (x) }{{} err(g(x),f (x)) err: called pointwise error measure in-sample out-of-sample E in (g) = 1 N N err(g(x n ), f (x n )) n=1 E out (g) = E x P err(g(x), f (x)) will mainly consider pointwise err for simplicity Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25
Error Measure Two Important Pointwise Error Measures err g(x), f (x) }{{}}{{} ỹ y 0/1 error err(ỹ, y) = ỹ y correct or incorrect? often for classification squared error err(ỹ, y) = (ỹ y) 2 how far is ỹ from y? often for regression how does err guide learning? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25
Error Measure Ideal Mini-Target interplay between noise and error: P(y x) and err define ideal mini-target f (x) P(y = 1 x) = 0.2, P(y = 2 x) = 0.7, P(y = 3 x) = 0.1 err(ỹ, y) = ỹ y 1 avg. err 0.8 2 avg. err 0.3( ) ỹ = 3 avg. err 0.9 1.9 avg. err 1.0(really? :-)) err(ỹ, y) = (ỹ y) 2 1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29( ) f (x) = argmax P(y x) y Y f (x) = y Y y P(y x) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25
Error Measure unknown target distribution P(y x) containing f (x) + noise Learning Flow with Error Measure (ideal credit approval formula) x 1, x 2,, x N y 1, y 2,, y N unknown P on X y x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) error measure err extended VC theory/ philosophy works for most H and err Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25
Error Measure Fun Time Consider the following P(y x) and err(ỹ, y) = ỹ y. Which of the following is the ideal mini-target f (x)? P(y = 1 x) = 0.10, P(y = 2 x) = 0.35, P(y = 3 x) = 0.15, P(y = 4 x) = 0.40. 1 2 = weighted median from P(y x) 2 2.5 = average within Y = {1, 2, 3, 4} 3 2.85 = weighted mean from P(y x) 4 4 = argmax P(y x) Reference Answer: 1 For the absolute error, the weighted median provably results in the minimum average err. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25
Algorithmic Error Measure Choice of Error Measure Fingerprint Verification f +1 you 1 intruder two types of error: false accept and false reject g +1-1 +1 no error false reject f -1 false accept no error 0/1 error penalizes both types equally Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25
Algorithmic Error Measure Fingerprint Verification for Supermarket Fingerprint Verification f +1 you 1 intruder two types of error: false accept and false reject g +1-1 +1 no error false reject f -1 false accept no error supermarket: fingerprint for discount f g +1-1 +1 0 10-1 1 0 false reject: very unhappy customer, lose future business false accept: give away a minor discount, intruder left fingerprint :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Algorithmic Error Measure Fingerprint Verification for CIA Fingerprint Verification f +1 you 1 intruder two types of error: false accept and false reject g +1-1 +1 no error false reject f -1 false accept no error CIA: fingerprint for entrance false accept: very serious consequences! false reject: unhappy employee, but so what? :-) f g +1-1 +1 0 1-1 1000 0 Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25
Algorithmic Error Measure Take-home Message for Now err is application/user-dependent Algorithmic Error Measures êrr true: just err plausible: 0/1: minimum flipping noise NP-hard to optimize, remember? :-) squared: minimum Gaussian noise friendly: easy to optimize for A closed-form solution convex objective function êrr: more in next lectures Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25
Algorithmic Error Measure Learning Flow with Algorithmic Error Measure unknown target distribution P(y x) containing f (x) + noise (ideal credit approval formula) x 1, x 2,, x N y 1, y 2,, y N unknown P on X y x training examples D : (x 1, y 1 ),, (x N, y N ) (historical records in bank) learning algorithm A err ˆ final hypothesis g f ( learned formula to be used) hypothesis set H (set of candidate formula) error measure err err: application goal; êrr: a key part of many A Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25
Algorithmic Error Measure Fun Time Consider err below for CIA. What is E in (g) when using this err? f g +1-1 +1 0 1-1 1000 0 1 2 3 4 1 N 1 N 1 N 1 N N y n g(x n ) n=1 ( y n=+1 ( ( y n=+1 1000 y n g(x n ) + 1000 y n g(x n ) 1000 y n=+1 y n g(x n ) + y n= 1 y n= 1 y n= 1 y n g(x n ) y n g(x n ) y n g(x n ) ) ) ) Reference Answer: 2 When y n = 1, the false positive made on such (x n, y n ) is penalized 1000 times more! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25
Weighted Classification Weighted Classification CIA Cost (Error, Loss,...) Matrix out-of-sample in-sample E out (h) = E in (h) = 1 N y h(x) +1-1 +1 0 1-1 1000 0 { 1 if y = +1 E (x,y) P 1000 if y = 1 N n=1 { 1 if yn = +1 1000 if y n = 1 } y h(x) } y n h(x n ) weighted classification: different weight for different (x, y) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25
Weighted Classification Minimizing E in for Weighted Classification E w in (h) = 1 N Naïve Thoughts N n=1 { 1 if yn = +1 1000 if y n = 1 PLA: doesn t matter if linear separable. :-) } y n h(x n ) pocket: modify pocket-replacement rule if w t+1 reaches smaller E w in than ŵ, replace ŵ by w t+1 pocket: some guarantee on E 0/1 in ; modified pocket: similar guarantee on E w in? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/25
Weighted Classification Systematic Route: Connect E w in and E 0/1 in original problem y h(x) +1-1 +1 0 1-1 1000 0 equivalent problem y h(x) +1-1 +1 0 1-1 1 0 D: (x 1, +1) (x 2, 1) (x 3, 1)... (x N 1, +1) (x N, +1) (x 1, +1) (x 2, 1), (x 2, 1),..., (x 2, 1) (x 3, 1), (x 3, 1),..., (x 3, 1)... (x N 1, +1) (x N, +1) after copying 1 examples 1000 times, Ein w 0/1 for LHS Ein for RHS! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/25
Weighted Classification Weighted Pocket Algorithm y h(x) +1-1 +1 0 1-1 1000 0 using virtual copying, weighted pocket algorithm include: weighted PLA: randomly check 1 example mistakes with 1000 times more probability weighted pocket replacement: if w t+1 reaches smaller E w in than ŵ, replace ŵ by w t+1 systematic route (called reduction ): can be applied to many other algorithms! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/25
Weighted Classification Fun Time Consider the CIA cost matrix. If there are 10 examples with y n = 1 (intruder) and 999, 990 examples with y n = +1 (you). What would Ein w (h) be for a constant h(x) that always returns +1? y h(x) +1-1 +1 0 1-1 1000 0 1 0.001 2 0.01 3 0.1 4 1 Reference Answer: 2 While the quiz is a simple evaluation, it is not uncommon that the data is very unbalanced for such an application. Properly setting the weights can be used to avoid the lazy constant prediction. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 24/25
Weighted Classification Summary 1 When Can Machines Learn? 2 Why Can Machines Learn? Lecture 7: The VC Dimension Lecture 8: Noise and Error Noise and Probabilistic Target can replace f (x) by P(y x) Error Measure affect ideal target Algorithmic Error Measure user-dependent = plausible or friendly Weighted Classification easily done by virtual example copying next: more algorithms, please? :-) 3 How Can Machines Learn? 4 How Can Machines Learn Better? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 25/25