outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

Size: px

Start display at page:

Download "outline Nonlinear transformation Error measures Noisy targets Preambles to the theory"

Miranda Barker
5 years ago
Views:

1 Error and Noise

2 outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

3 Linear is limited Data Hypothesis

4 Linear in what? Linear regression implements Linear classification implements Both are functions linear in x s. j= 0 More importantly from the learning point of view, they are linear in w s (parameters in learning). d w j x j d sign w x j j j= 0 Both algorithms, PLA and Regression, work because of linearity in the weights. So, we can apply any nonlinear transformation to data (x s are constant) and still remain in the realm of linear models.

5 Transform data nonlinearly Φ (x 1, x 2 ) (z 1, z 2 ) = (x 12, x 22 ) x Z

6 None Linear transformation None-linear transforms give the ability to linearly separate any data points by moving to a new space with sufficiently larger dimensions. A tool to get more sophisticated surfaces in the initial space while we are still able to use simple linear techniques. Any transformation Φ: X Z on data preserves the linearity of the model, w T x.

Transformed data z n = Φ(x n ) Z Φ -1 4.

7 The cycle of nonlinear transformation Φ 1. Original data x n X 2. Transformed data z n = Φ(x n ) Z Φ Classify in X-space g(x) = g (Φ(x)) = sign (w T Φ(x)) 3. Separate data in Z-space g (z) = sign (w T z)

8 What transforms to what? Φ x = (x 0, x 1,..., x d ) z = (z 0, z 1,..., z d ) x 1, x 2,, x n Φ z 1, z 2,, z n y 1, y 2,, y n Φ y 1, y 2,, y n No weights in X w = (w 0, w 1,, w d ) Hypothesis: g(z) = sign (w T z) g(x) = sign (w T Φ(x))

9 The learning diagram so far x 1,..., x N LEARNING ALGORITHM A

10 Error measures Error measures try to answer the following question: What does h f mean? Error measure: E (h, f) Pointwise definition e (h(x), f(x)) Examples Squared error e (h(x), f(x)) = ( h(x) f(x) ) 2 Binary error e (h(x), f(x)) = [ h(x) f(x) ]

11 Overall error E (h, f) = average of pointwise errors e(h(x), f(x)) In-sample error: E in n 1 ( h) = e( h( xi ), f ( x n i= 1 i )) Out-of- sample error: [ e( h( x), f ( ))] E out ( h) = E x x Expected value w.r.t. x. (x is a general point in X space)

12 The learning diagram with pointwise error x 1,..., x N x g(x) f(x) LEARNING ALGORITHM A

13 Fingerprint verification: The error cost Two types of error: false accept (false positive - FP) and false reject (false negative - FN) +1 f 1 h +1 1 no error false reject false accept no error Confusion Matrix How to penalize each type?

14 The error cost for supermarkets Supermarket verifies fingerprints for discounts False reject is costly; risk of loosing a customer! False accept is minor; gave away a discount, and intruder left their fingerprint! +1 f 1 h

15 The error cost for CIA CIA verifies fingerprints for security False accept is a disaster! False reject can be tolerated. Try again! h +1 1 f In practical problems: the error measure and the cost should be specified by the user.

16 The learning diagram with error measure LEARNING ALGORITHM A x 1,..., x N x ERROR MEASURE e() g(x) f(x)

17 Noisy targets The target function is not always a function. Consider the credit-card approval age gender Annual salary Years in residence Years in job Current debt 23 male $30, $15, male $30, $15,000-1 Class Two identical customers two different behaviors

18 Target distribution Instead of target function y = f(x) we use target distribution P(y x) (x, y) is now generated by the joint distribution (assuming independence): P(x) P(y x) Noisy target = deterministic target + noise f(x) = E (y x) y f(x) Deterministic target is a special case of noisy target, where P(y x) is zero except for y = f(x)

19 The learning diagram including noisy target LEARNING ALGORITHM A x 1,..., x N x ERROR MEASURE e() g(x) f(x)

20 Distinction between P(y x) and P(x) Both convey probabilistic aspects of x and y The input distribution P(x) quantifies relative importance of x The target distribution P(y x) is what we are trying to learn Merging P(x) P(y x) as P(x, y) mixes two concepts inherently different

21 Preamble to the theory What we know so far: Learning is feasible, in a probabilistic sense. It is likely that E out (g) E in (g) Is this learning? Indeed, we need to learn a g such that g f which means E out (g) 0 (good) Generalization (good) Learning

22 The 2 questions of learning E out (g) 0 (the learning) is achieved through two conditions 1. E out (g) E in (g) (developed using Hoeffding) and 2. E in (g) 0 Theoretical result (Lecture 2) Practical result (Lecture 3) Learning reduces to 2 questions: 1. Can we make sure that E out (g) is close enough to E in (g)? 2. Can we make E in (g) small enough?

23 What the theory will achieve Two important things the theory does: 1. Characterize the feasibility of learning for infinite M (M is the number of hypothesis) 2. Characterizing the tradeoff: Model complexity Model complexity E in E in E out

Machine Learning Foundations

Machine Learning Foundations ( 機器學習基石 ) Lecture 8: Noise and Error Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系