Machine Learning Foundations

Similar documents
Machine Learning Foundations

Machine Learning Foundations

Machine Learning Foundations

Array 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Polymorphism 11/09/2015. Department of Computer Science & Information Engineering. National Taiwan University

Encapsulation 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Inheritance 11/02/2015. Department of Computer Science & Information Engineering. National Taiwan University

Basic Java OOP 10/12/2015. Department of Computer Science & Information Engineering. National Taiwan University

Learning From Data Lecture 7 Approximation Versus Generalization

Cross-validation for detecting and preventing overfitting

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Feedforward Neural Networks

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

CSC 411: Lecture 02: Linear Regression

Learning From Data Lecture 12 Regularization

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

CLASSIFICATION (DISCRIMINATION)

15. Eigenvalues, Eigenvectors

INTRODUCTION TO DIOPHANTINE EQUATIONS

Introduction to Support Vector Machines

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

The Force Table Introduction: Theory:

4.7. Newton s Method. Procedure for Newton s Method HISTORICAL BIOGRAPHY

Polynomial approximation and Splines

Advanced Introduction to Machine Learning CMU-10715

Linear correlation. Contents. 1 Linear correlation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Unit 3 NOTES Honors Common Core Math 2 1. Day 1: Properties of Exponents

MATH Line integrals III Fall The fundamental theorem of line integrals. In general C

2.1 Rates of Change and Limits AP Calculus

5. Zeros. We deduce that the graph crosses the x-axis at the points x = 0, 1, 2 and 4, and nowhere else. And that s exactly what we see in the graph.

Learning from Data: Regression

CSE 546 Midterm Exam, Fall 2014

PAC-learning, VC Dimension and Margin-based Bounds

Biostatistics in Research Practice - Regression I

MA123, Chapter 8: Idea of the Integral (pp , Gootman)

0.24 adults 2. (c) Prove that, regardless of the possible values of and, the covariance between X and Y is equal to zero. Show all work.

Physics Gravitational force. 2. Strong or color force. 3. Electroweak force

Learning From Data Lecture 10 Nonlinear Transforms

Lecture 3: Introduction to Complexity Regularization

Analytic Trigonometry

Optimization Methods for Machine Learning (OMML)

8. BOOLEAN ALGEBRAS x x

Figure 1: Visualising the input features. Figure 2: Visualising the input-output data pairs.

ECE521 Lecture7. Logistic Regression

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

LAB 05 Projectile Motion


Problems set # 2 Physics 169 February 11, 2015

FIRST- AND SECOND-ORDER IVPS The problem given in (1) is also called an nth-order initial-value problem. For example, Solve: Solve:

INF Introduction to classifiction Anne Solberg

A. Real numbers greater than 2 B. Real numbers less than or equal to 2. C. Real numbers between 1 and 3 D. Real numbers greater than or equal to 2

Statistical Learning Theory: Generalization Error Bounds

February 11, JEL Classification: C72, D43, D44 Keywords: Discontinuous games, Bertrand game, Toehold, First- and Second- Price Auctions

Linear regression Class 25, Jeremy Orloff and Jonathan Bloom

1.2 Relations. 20 Relations and Functions

14.2 Choosing Among Linear, Quadratic, and Exponential Models

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Symmetry Arguments and the Role They Play in Using Gauss Law

Representation of Functions by Power Series. Geometric Power Series

Notes on Discriminant Functions and Optimal Classification

4.2 Mean Value Theorem Calculus

The Maze Generation Problem is NP-complete

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

LESSON #11 - FORMS OF A LINE COMMON CORE ALGEBRA II

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Linear Models for Regression CS534

Unit 12 Study Notes 1 Systems of Equations

2: Distributions of Several Variables, Error Propagation

14.1 Systems of Linear Equations in Two Variables

Machine Learning. Lecture 2: Linear regression. Feng Li.

4 Bias-Variance for Ridge Regression (24 points)

2.1 Rates of Change and Limits AP Calculus

What is Fuzzy Logic? Fuzzy logic is a tool for embedding human knowledge (experience, expertise, heuristics) Fuzzy Logic

Computer Problems for Taylor Series and Series Convergence

LESSON 4.3 GRAPHING INEQUALITIES

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Functions of One Variable Basics

(MTH5109) GEOMETRY II: KNOTS AND SURFACES LECTURE NOTES. 1. Introduction to Curves and Surfaces

MA Game Theory 2005, LSE

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

a. plotting points in Cartesian coordinates (Grade 9 and 10), b. using a graphing calculator such as the TI-83 Graphing Calculator,

Laurie s Notes. Overview of Section 3.5

LESSON #12 - FORMS OF A LINE COMMON CORE ALGEBRA II

EPIPOLAR GEOMETRY WITH MANY DETAILS

Linear Models for Regression CS534

Ch 5 Alg 2 L2 Note Sheet Key Do Activity 1 on your Ch 5 Activity Sheet.

16.5. Maclaurin and Taylor Series. Introduction. Prerequisites. Learning Outcomes

Rolle s Theorem. THEOREM 3 Rolle s Theorem. x x. then there is at least one number c in (a, b) at which ƒ scd = 0.

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

8.1 Exponents and Roots

Transcription:

Machine Learning Foundations ( 機器學習基石 ) Lecture 13: Hazard of Overfitting Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan Universit ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

1 When Can Machines Learn? 2 Wh Can Machines Learn? 3 How Can Machines Learn? Roadmap Lecture 12: Nonlinear Transform nonlinear via nonlinear feature transform Φ plus linear with price of model compleit 4 How Can Machines Learn Better? Lecture 13: Hazard of Overfitting What is Overfitting? The Role of Noise and Data Size Deterministic Noise Dealing with Overfitting Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/22

What is Overfitting? Bad Generalization regression for R with N = 5 eamples target f () = 2nd order polnomial label n = f ( n ) + ver small noise linear regression in Z-space + Φ = 4th order polnomial unique solution passing all eamples = E in (g) = 0 E out (g) huge Data Target Fit bad generalization: low E in, high E out Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22

What is Overfitting? Bad Generalization and Overfitting take d VC = 1126 for learning: bad generalization (E out - E in ) large switch from d VC = d VC to d VC = 1126: overfitting E in, E out switch from d VC = d VC to d VC = 1: underfitting E in, E out Error d vc out-of-sample error model compleit in-sample error VC dimension, d vc bad generalization: low E in, high E out ; overfitting: lower E in, higher E out Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/22

What is Overfitting? Cause of Overfitting: A Driving Analog Data Target Fit good fit = overfit learning overfit use ecessive d VC noise limited data size N driving commit a car accident drive too fast bump road limited observations about road condition net: how does noise & data size affect overfitting? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22

What is Overfitting? Fun Time Based on our discussion, for data of fied size, which of the following situation is relativel of the lowest risk of overfitting? 1 small noise, fitting from small d VC to median d VC 2 small noise, fitting from small d VC to large d VC 3 large noise, fitting from small d VC to median d VC 4 large noise, fitting from small d VC to large d VC Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/22

What is Overfitting? Fun Time Based on our discussion, for data of fied size, which of the following situation is relativel of the lowest risk of overfitting? 1 small noise, fitting from small d VC to median d VC 2 small noise, fitting from small d VC to large d VC 3 large noise, fitting from small d VC to median d VC 4 large noise, fitting from small d VC to large d VC Reference Answer: 1 Two causes of overfitting are noise and ecessive d VC. So if both are relativel under control, the risk of overfitting is smaller. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/22

The Role of Noise and Data Size Case Stud (1/2) 10-th order target function + noise 50-th order target function noiselessl Data Target Data Target overfitting from best g 2 H 2 to best g 10 H 10? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22

The Role of Noise and Data Size Case Stud (2/2) 10-th order target function + noise 50-th order target function noiselessl Data 2nd Order Fit 10th Order Fit Data 2nd Order Fit 10th Order Fit g 2 H 2 g 10 H 10 E in 0.050 0.034 E out 0.127 9.00 g 2 H 2 g 10 H 10 E in 0.029 0.00001 E out 0.120 7680 overfitting from g 2 to g 10? both es! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22

The Role of Noise and Data Size Iron of Two Learners Data 2nd Order Fit 10th Order Fit learner Overfit: pick g 10 H 10 learner Restrict: pick g 2 H 2 when both know that target = 10th R gives up abilit to fit Data Target but R wins in E out a lot! philosoph: concession for advantage? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/22

The Role of Noise and Data Size Learning Curves Revisited H 2 H 10 Epected Error E out E in Epected Error E out Number of Data Points, N E in Number of Data Points, N H 10 : lower E out when N, but much larger generalization error for small N gra area : O overfits! (E in, E out ) R alwas wins in E out if N small! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22

The Role of Noise and Data Size The No Noise Case Data 2nd Order Fit 10th Order Fit learner Overfit: pick g 10 H 10 Data Target learner Restrict: pick g 2 H 2 when both know that there is no noise R still wins is there reall no noise? target compleit acts like noise Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22

The Role of Noise and Data Size Fun Time When having limited data, in which of the following case would learner R perform better than learner O? 1 limited data from a 10-th order target function with some noise 2 limited data from a 1126-th order target function with no noise 3 limited data from a 1126-th order target function with some noise 4 all of the above Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22

The Role of Noise and Data Size Fun Time When having limited data, in which of the following case would learner R perform better than learner O? 1 limited data from a 10-th order target function with some noise 2 limited data from a 1126-th order target function with no noise 3 limited data from a 1126-th order target function with some noise 4 all of the above Reference Answer: 4 We discussed about 1 and 2, but ou shall be able to generalize :-) that R also wins in the more difficult case of 3. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22

Deterministic Noise A Detailed Eperiment = f () + ɛ Gaussian ( Qf ) α q q, σ 2 q=0 } {{ } f () Gaussian iid noise ɛ with level σ 2 some uniform distribution on f () with compleit level Q f data size N Data Target goal: overfit level for different (N, σ 2 ) and (N, Q f )? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22

Deterministic Noise The Overfit Measure Data 2nd Order Fit 10th Order Fit Data Target g 2 H 2 g 10 H 10 E in (g 10 ) E in (g 2 ) for sure overfit measure E out (g 10 ) E out (g 2 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/22

Deterministic Noise The Results impact of σ 2 versus N impact of Q f versus N 0.2 100 0.2 Noise Level, σ 2 2 1 0 80 100 120 Number of Data Points, N fied Q f = 20 0.1 0-0.1-0.2 Target Compleit, Qf 75 50 25 0 80 100 120 Number of Data Points, N fied σ 2 = 0.1 0.1 0-0.1-0.2 ring a bell? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22

impact of σ 2 versus N: stochastic noise Noise Level, σ 2 2 1 0 Deterministic Noise Impact of Noise and Data Size 80 100 120 Number of Data Points, N 0.2 0.1 0-0.1-0.2 impact of Q f versus N: deterministic noise Target Compleit, Qf 100 75 50 25 0 80 100 120 Number of Data Points, N 0.2 0.1 0-0.1-0.2 four reasons of serious overfitting: data size N stochastic noise deterministic noise ecessive power overfit overfit overfit overfit overfitting easil happens Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/22

Deterministic Noise Deterministic Noise if f / H: something of f cannot be captured b H deterministic noise : difference between best h H and f acts like stochastic noise not new to CS: pseudo-random generator difference to stochastic noise: depends on H fied for a given f h philosoph: when teaching a kid, perhaps better not to use eamples from a complicated target function? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22

Deterministic Noise Fun Time Consider the target function being sin(1126) for [0, 2π]. When is uniforml sampled from the range, and we use all possible linear hpotheses h() = w to approimate the target function with respect to the squared error, what is the level of deterministic noise for each? 1 sin(1126) 2 sin(1126) 3 sin(1126) + 4 sin(1126) 1126 Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22

Deterministic Noise Fun Time Consider the target function being sin(1126) for [0, 2π]. When is uniforml sampled from the range, and we use all possible linear hpotheses h() = w to approimate the target function with respect to the squared error, what is the level of deterministic noise for each? 1 sin(1126) 2 sin(1126) 3 sin(1126) + 4 sin(1126) 1126 Reference Answer: 1 You can tr a few different w and convince ourself that the best hpothesis h is h () = 0. The deterministic noise is the difference between f and h. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22

Dealing with Overfitting Driving Analog Revisited learning overfit use ecessive d VC noise limited data size N start from simple model data cleaning/pruning data hinting regularization validation driving commit a car accident drive too fast bump road limited observations about road condition drive slowl use more accurate road information eploit more road information put the brakes monitor the dashboard all ver practical techniques to combat overfitting Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/22

Dealing with Overfitting Data Cleaning/Pruning if detect the outlier 5 at the top b too close to other, or too far from other wrong b current classifier... possible action 1: correct the label (data cleaning) possible action 2: remove the eample (data pruning) possibl helps, but effect varies Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22

Dealing with Overfitting Data Hinting slightl shifted/rotated digits carr the same meaning possible action: add virtual eamples b shifting/rotating the given digits (data hinting) possibl helps, but watch out virtual eample not iid P(, )! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/22

Dealing with Overfitting Fun Time Assume we know that f () is smmetric for some 1D regression application. That is, f () = f ( ). One possibilit of using the knowledge is to consider smmetric hpotheses onl. On the other hand, ou can also generate virtual eamples from the original data {( n, n )} as hints. What virtual eamples suit our needs best? 1 {( n, n )} 2 {( n, n )} 3 {( n, n )} 4 {(2 n, 2 n )} Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22

Dealing with Overfitting Fun Time Assume we know that f () is smmetric for some 1D regression application. That is, f () = f ( ). One possibilit of using the knowledge is to consider smmetric hpotheses onl. On the other hand, ou can also generate virtual eamples from the original data {( n, n )} as hints. What virtual eamples suit our needs best? 1 {( n, n )} 2 {( n, n )} 3 {( n, n )} 4 {(2 n, 2 n )} Reference Answer: 3 We want the virtual eamples to encode the invariance when. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22

Dealing with Overfitting Summar 1 When Can Machines Learn? 2 Wh Can Machines Learn? 3 How Can Machines Learn? Lecture 12: Nonlinear Transform 4 How Can Machines Learn Better? Lecture 13: Hazard of Overfitting What is Overfitting? lower E in but higher E out The Role of Noise and Data Size overfitting easil happens! Deterministic Noise what H cannot capture acts like noise Dealing with Overfitting data cleaning/pruning/hinting, and more net: putting the brakes with regularization Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22