Machine Learning Brett Bernstein

Similar documents
Machine Learning Brett Bernstein

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

10-701/ Machine Learning Mid-term Exam Solution

Intro to Learning Theory

REGRESSION WITH QUADRATIC LOSS

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Regression with quadratic loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Problem Set 4 Due Oct, 12

Linear Support Vector Machines

Chapter 2 The Monte Carlo Method

Empirical Process Theory and Oracle Inequalities

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Estimation for Complete Data

Simulation. Two Rule For Inverting A Distribution Function

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

This section is optional.

A survey on penalized empirical risk minimization Sara A. van de Geer

Convergence of random variables. (telegram style notes) P.J.C. Spreij

1 Review of Probability & Statistics

Sieve Estimators: Consistency and Rates of Convergence

Lecture 19: Convergence

Infinite Sequences and Series

Machine Learning Theory (CS 6783)

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

7.1 Convergence of sequences of random variables

6.3 Testing Series With Positive Terms

Chapter 9: Numerical Differentiation

An Introduction to Randomized Algorithms

Lecture 7: October 18, 2017

Maximum Likelihood Estimation and Complexity Regularization

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Unbiased Estimation. February 7-12, 2008

STAT Homework 2 - Solutions

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Lecture 10 October Minimaxity and least favorable prior sequences

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

6.867 Machine learning, lecture 7 (Jaakkola) 1

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Learning Theory: Lecture Notes

Output Analysis and Run-Length Control

7.1 Convergence of sequences of random variables

Random Variables, Sampling and Estimation

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Inferential Methods for Correlation and Regression Analysis

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Markov Decision Processes

Introduction to Machine Learning DIS10

Lecture 3: August 31

Lecture 6: Coupon Collector s problem

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Mixtures of Gaussians and the EM Algorithm

Chapter 6 Infinite Series

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Problem Set 2 Solutions

Monte Carlo Integration

1 Review and Overview

Alternating Series. 1 n 0 2 n n THEOREM 9.14 Alternating Series Test Let a n > 0. The alternating series. 1 n a n.

Lecture 13: Maximum Likelihood Estimation

Chapter 6 Sampling Distributions

Agnostic Learning and Concentration Inequalities

Sequences. A Sequence is a list of numbers written in order.

Sequences and Series of Functions

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Expectation and Variance of a random variable

Lecture 11 and 12: Basic estimation theory

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Time-Domain Representations of LTI Systems

Naïve Bayes. Naïve Bayes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Lecture 2: Monte Carlo Simulation

1 Review and Overview

Section 13.3 Area and the Definite Integral

2.2. Central limit theorem.

Lecture 15: Learning Theory: Concentration Inequalities

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Lecture Notes 15 Hypothesis Testing (Chapter 10)

Lecture 6: Integration and the Mean Value Theorem. slope =

Support vector machine revisited

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

Optimization Methods MIT 2.098/6.255/ Final exam

1 Introduction to reducing variance in Monte Carlo simulations

4.3 Growth Rates of Solutions to Recurrences

Math 312 Lecture Notes One Dimensional Maps

DISTRIBUTION LAW Okunev I.V.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

1 Approximating Integrals using Taylor Polynomials

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

Parameter, Statistic and Random Samples

Transcription:

Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio has margial distributio X Uif{1,..., 10}. Furthermore, assume Y = X (i.e., Y always has the exact same value as X). I the questios below we use square loss fuctio l(a, x) = (a x) 2. (a) What is the Bayes risk? (b) What is the approximatio error whe usig the hypothesis space of costat fuctios? (c) Suppose we use the hypothesis space F of affie fuctios. i. What is the approximatio error? ii. Cosider the fuctio ˆf(x) = x + 1. Compute R( ˆf) R(f F ). (a) The best decisio fuctio is f (x) = x. The associated risk is 0. (b) The best costat fuctio is f(x) = E[Y ] = E[X] = 5.5. This has risk E[(Y 5.5) 2 ] = Var(Y ) = 33 4, by usig (or derivig) the formula for the variace of a discrete uiform distributio. Thus the approximatio error is 33/4. (c) i. The Bayes decisio fuctio is affie, so the approximatio error is 0. ii. The risk is Thus the aswer is 1. R( ˆf) = E[(Y ˆf(X)) 2 ] = E[(X (X + 1)) 2 ] = 1. 2. ( ) Let X = [ 10, 10], Y = A = R ad suppose the data distributio has margial distributio X Uif( 10, 10) ad Y X = x N (a + bx, 1). Throughout we assume the square loss fuctio l(a, x) = (a x) 2. (a) What is the Bayes risk? 1

(b) What is the approximatio error whe usig the hypothesis space of costat fuctios (i terms of a ad b)? (c) Suppose we use the hypothesis space of affie fuctios. i. What is the approximatio error? ii. Suppose you have a fixed data set ad compute the empirical risk miimizer ˆf (x) = c + dx. What is the estimatio error (it terms of a, b, c, d)? Throughout we use the fact that Var(X) = E[X 2 ] E[X] 2. (a) The best decisio fuctio is f(x) = E[Y X = x] = a + bx. This has risk E[(Y a bx) 2 ] = E[E[(Y a bx) 2 X]] = E[1] = 1. (b) The best costat fuctio is give by E[Y ] = E[E[Y X]] = a + be[x] = a. This has risk where E[(Y a) 2 ] = E[E[(Y a) 2 X]] = E[1 + b 2 X 2 ] = 1 + b 2 E[X 2 ], E[X 2 ] = 10 10 Thus the approximatio error is 100b 2 /3. x 2 2000 dx = 20 3 20 = 100 3. (c) i. There is a affie Bayes decisio fuctio, so the approximatio error is 0. ii. Note that R( ˆf ) = E[(Y c dx) 2 ] = E[E[(Y c dx) 2 X]] = E[1 + ((a c) + (b d)x) 2 ] = 1 + (a c) 2 + 100(b d) 2 /3. Thus the estimatio error is (a c) 2 + 100(b d) 2 /3. 3. Try to best characterize each of the followig i terms of oe or more of optimizatio error, approximatio error, ad estimatio error. (a) Overfittig. (b) Uderfittig. (c) Precise empirical risk miimizatio for your hypothesis space is computatioally itractable. (d) Not eough data. (a) High estimatio error due to isufficiet data relative to the complexity of your hypothesis space. Ca be accompaied by low approximatio error idicatig a complex hypothesis space. 2

(b) High approximatio error due to a overly simplistic hypothesis space. Ca be accompaied by low estimatio estimatio error due to the large amout of data relative to the (low) complexity of the hypothesis space. (c) Icreased optimizatio error. (d) High estimatio error. 4. (a) We sometimes look at R( ˆf ) as radom, ad other times as determiistic. What causes this differece? (b) True or False: Icreasig the size of our hypothesis space ca shift risk from approximatio error to estimatio error but always leaves the quatity R( ˆf ) R(f ) costat. (c) True or False: Assume we treat our data set as a radom sample ad ot a fixed quatity. The the estimatio error ad the approximatio error are radom ad ot determiistic. (d) True or False: The empirical risk of the ERM, ˆR( ˆf ), is a ubiased estimator of the risk of the ERM R( ˆf ). (e) I each of the followig situatios, there is a implicit sample space i which the give expectatio is computed. Give that space. i. Whe we say the empirical risk ˆR(f) is a ubiased estimator of the risk R(f) (where f is idepedet of the traiig data used to compute the empirical risk). ii. Whe we compute the expected empirical risk E[R( ˆf )] (i.e., the outer expectatio). iii. Whe we say the miibatch gradiet is a ubiased estimator of the full traiig set gradiet. (a) The quatity is radom whe we cosider the traiig data as a radom sample of size. If we focus o a fixed set of traiig data the the quatity is determiistic. (b) False. Note that ˆf depeds o which hypothesis space you have chose. As a example, imagie havig a affie Bayes decisio fuctio, ad chagig the hypothesis space from the set of affie fuctios to the set of all decisio fuctios. This ca cause empirical risk miimizatio to overfit the traiig data thus creatig a sharp rise i R( ˆf ) R(f ). (c) False, approximatio error is a determiistic quatity. (d) False. The empirical risk of the ERM will ofte be biased low. This is why we use a test set to approximate its true risk. The issue is that ˆf depeds o the traiig data so El( ˆf (x i ), y i ) El( ˆf (x), y) 3

(e) where x, y is a ew radom draw from the data distributio that is t i the traiig data. i. The space of traiig sets (i.e., samples of size from the data geeratig distributio). ii. The space of traiig sets (i.e., samples of size from the data geeratig distributio). iii. The space of all miibatches chose from the full traiig set (i.e., samples of of the batch size from the empirical distributio o the full traiig set). 5. For each, use,, or = to determie the relatioship betwee the two quatities, or if the relatioship caot be determied. Throughout assume F 1, F 2 are hypothesis spaces with F 1 F 2, ad assume we are workig with a fixed loss fuctio l. (a) The estimatio errors of two decisio fuctios f 1, f 2 that miimize the empirical risk over the same hypothesis space, where f 2 uses 5 extra data poits. (b) The approximatio errors of the two decisio fuctios f 1, f 2 that miimize risk with respect to F 1, F 2, respectively (i.e., f 1 = f F1 ad f 2 = f F2 ). (c) The empirical risks of two decisio fuctios f 1, f 2 that miimize the empirical risk over F 1, F 2, respectively. Both use the same fixed traiig data. (d) The estimatio errors (for F 1, F 2, respectively) of two decisio fuctios f 1, f 2 that miimize the empirical risk over F 1, F 2, respectively. (e) The risk of two decisio fuctios f 1, f 2 that miimize the empirical risk over F 1, F 2, respectively. (a) Roughly speakig, more data is better, so we would ted to expect that f 2 will have lower estimatio error. That said, this is ot always the case, so the relatioship caot be determied. (b) The approximatio error of f 1 will be larger. (c) The empirical risk of f 1 will be larger. (d) Roughly speakig, icreasig the hypothesis space should icrease the estimatio error sice the approximatio error will decrease, ad we expect to eed more data. That said, this is ot always the case, so the aswer is the relatioship caot be determied. (e) Caot be determied. 6. I the excess risk decompositio lecture, we itroduced the decisio tree classifier spaces F (space of all decisio trees) ad F d (the space of decisio trees of depth d) ad wet through some examples. The followig questios are based o those slides. Recall that P X = Uif([0, 1] 2 ), Y = {blue, orage}, orage occurs with.9 probability below the lie y = x ad blue occurs with.9 probability above the lie y = x. 4

(a) Prove that the Bayes error rate is 0.1. (b) Is the Bayes decisio fuctio i F? (c) For the hypothesis space F 3 the slide states that R( f) = 0.176±.004 for = 1024. Assumig you had access to the traiig code that produces f from a set of data poits, ad radom draws from the data geeratig distributio, give a algorithm (pseudocode) to compute (or estimate) the values 0.176 ad.004. (a) Sice the output space is discrete ad we are usig the 0 1 loss, our best predictio is the highest probability output coditioal o the iput. By choosig orage below the lie y = x ad blue above, we obtai a.1 probability of error. For the 0 1 loss, probability of error gives the risk. (b) No. Ay decisio tree i F has fiite depth, ad thus will divide [0, 1] 2 ito a fiite umber of rectagles. Thus we caot produce the decisio boudary y = x used by the Bayes decisio fuctio. (c) Pseudocode follows: i. Iitialize L to be a empty list of risks. ii. Repeat the followig M times for some sufficietly large M: A. Draw a radom sample (x 1, y 1 ),..., (x, y ) from the data geeratig distributio. B. Obtai a decisio fuctio f by ruig our traiig algorithm o the geerated sample. C. Draw a ew radom sample (x 1, y 1),..., (x S, y S ) of size S where S is sufficietly large. D. Compute e = {i f(x i ) y i}. That is, the umber of times f is icorrect o our ew sample. E. Add e/s to the list L. iii. Compute the sample average ad stadard deviatio of the values i L. Above.176 would be the average ad.004 would be the stadard deviatio. Istead of drawig the sample of size S we could have computed the risk aalytically. L 1 ad L 2 Regularizatio 1. Cosider the followig two miimizatio problems: arg mi Ω(w) + λ w L(f w (x i ), y i ) 5

ad arg mi CΩ(w) + 1 w L(f w (x i ), y i ), where Ω(w) is the pealty fuctio (for regularizatio) ad L is the loss fuctio. Give sufficiet coditios uder which these two give the same miimizer. Let C = 1/λ. The the two objectives differ by a costat factor. 2. ( ) Let f : R R be a differetiable fuctio. Prove that f(x) 2 L if ad oly if f is Lipschitz with costat L. First suppose f(x) 2 L for some L 0 ad all x R. By the mea value theorem we have, for ay x, y R, f(y) f(x) = f(x + ξ(y x)) T (y x), where ξ is some value betwee 0 ad 1. Takig absolute values o each side we have f(y) f(x) = f(x + ξ(y x)) T (y x) f(x + ξ(y x)) 2 y x 2 by Cauchy-Schwarz. Applyig our boud o the gradiet orm proves f is Lipschitz with costat L. Coversely, suppose f is Lipschitz with costat L. Note that f(x) T v = f (x; v) = lim f(x + tv) f(x) t 0 t lim t L v t 0 t Lettig v = f(x) we obtai f(x) 2 2 L f(x) 2 givig the result. 3. ( ) Let ŵ deote the miimizer for = L v. miimize w Xw y 2 2 subject to w 1 r. Prove that f(x) = ŵ T x is Lipschitz with costat r. Note that w 2 w 1 r, so the argumet from class gives the result. To see the iequality, ote that w 2 1 = ( w 1 + + w ) 2 w 1 2 + + w 2 = w 2 2. 4. Two of the plots i the lecture slides use the fact that ˆβ / β is always betwee 0 ad 1. Here ˆβ is the parameter vector of the liear model resultig from the regularized least squares problem. Aalgously, β is the parameter vector from the uregularized problem. Why is this true that the quotiet lies i [0, 1]? 6

We assume Ivaov regularizatio (sice Tikhoov is equivalet). We kow that 1 ( β T x i y i ) 2 1 ( ˆβ T x i y i ) 2 sice β is the solutio to the ucostraied miimizatio. But if β ˆβ the β is feasible for the regularized problem, so ˆβ = β. Thus β ˆβ. 5. Explai why feature ormalizatio is importat if you are usig L 1 or L 2 regularizatio. Suppose you have a model y = w T x where x 1 is a very correlated with y, but the feature is measured i meters. Thus w 1 = 4 would mea each icrease i x 1 by 1 meter yields a icrease i y by 4. Now suppose we chage the uits of w 1 to kilometers by scalig it. This would require us to chage w 1 to 4000 to achieve the same decisio fuctio. While this has o effect o the loss (y w T x) 2 it has a sigificat effect o λ w 2 2 or λ w 1. For example, eve if x 2,..., x had very little relatioship with y, we would still udervalue w 1 due to the regularizatio. 7