Machine Learning Brett Bernstein

Similar documents
NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Machine Learning Brett Bernstein

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Problem Set 4 Due Oct, 12

Maximum Likelihood Estimation and Complexity Regularization

Agnostic Learning and Concentration Inequalities

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 13: Maximum Likelihood Estimation

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 33: Bootstrap

Estimation for Complete Data

Lecture 11 and 12: Basic estimation theory

Unbiased Estimation. February 7-12, 2008

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Exponential Families and Bayesian Inference

Output Analysis and Run-Length Control

ECE 901 Lecture 13: Maximum Likelihood Estimation

Introduction to Machine Learning DIS10

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Topic 9: Sampling Distributions of Estimators

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Summary. Recap ... Last Lecture. Summary. Theorem

Lecture 7: October 18, 2017

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

This section is optional.

10-701/ Machine Learning Mid-term Exam Solution

Topic 9: Sampling Distributions of Estimators

Random Variables, Sampling and Estimation

6. Sufficient, Complete, and Ancillary Statistics

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

1 Introduction to reducing variance in Monte Carlo simulations

Topic 9: Sampling Distributions of Estimators

Statistical Inference Based on Extremum Estimators

EE 4TM4: Digital Communications II Probability Theory

Lecture 12: September 27

Lecture Chapter 6: Convergence of Random Sequences

Empirical Process Theory and Oracle Inequalities

Lecture 2: Monte Carlo Simulation

Chapter 6 Sampling Distributions

Simulation. Two Rule For Inverting A Distribution Function

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

AMS570 Lecture Notes #2

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Expectation and Variance of a random variable

Efficient GMM LECTURE 12 GMM II

7.1 Convergence of sequences of random variables

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

HOMEWORK I: PREREQUISITES FROM MATH 727

1.010 Uncertainty in Engineering Fall 2008

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

32 estimating the cumulative distribution function

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Intro to Learning Theory

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

STAT Homework 2 - Solutions

Mathematical Statistics - MS

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 6 Infinite Series

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Chapter 6 Principles of Data Reduction

Introductory statistics

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures


1 Covariance Estimation

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

An Introduction to Randomized Algorithms

7.1 Convergence of sequences of random variables

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Learning Theory: Lecture Notes

Stat410 Probability and Statistics II (F16)

Algorithms for Clustering

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Questions and Answers on Maximum Likelihood

Lecture 19: Convergence

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Distribution of Random Samples & Limit theorems

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

PRACTICE PROBLEMS FOR THE FINAL

( ) = p and P( i = b) = q.

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

1 Review of Probability & Statistics

REGRESSION WITH QUADRATIC LOSS

Transcription:

Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete joit distributio. Compute a Bayes decisio fuctio whe the loss fuctio l : A Y R is give by l(a, y) = (a y), the 0 loss. Solutio. The Bayes decisio fuctio f satisfies f = arg mi f where (X, Y ) P X Y. Let R(f) = arg mi f E[(f(X) Y )] = arg mi P (f(x) Y ), f f (x) = arg max P (Y = y X = x), y the maximum a posteriori estimate of Y. If there is a tie, we choose ay of the maximizers. If f 2 is aother decisio fuctio we have P (f (X) Y ) = x P (f (x) Y X = x)p (X = x) = x ( P (f (x) = Y X = x))p (X = x) x ( P (f 2(x) = Y X = x))p (X = x) (Def of f ) = x P (f 2(x) Y X = x)p (X = x) = P (f 2 (X) Y ). Thus f = f. 2. ( ) Suppose A = Y = R, X is some other set, ad l : A Y R is give by l(a, y) = (a y) 2, the square error loss. What is the Bayes risk ad how does it compare with the variace of Y? Solutio. From Homework we kow that the Bayes decisio fuctio is give by f (x) = E[Y X = x]. Thus the Bayes risk is give by E[(f (X) Y ) 2 ] = E[(E[Y X] Y ) 2 ] = E[E[(E[Y X] Y ) 2 X]] = E[Var(Y X)], where we applied the law of iterated expectatios. The law of total variace states that Var(Y ) = E[Var(Y X)] + Var[E(Y X)].

This proves the Bayes risk satisfies E[Var(Y X)] = Var(Y ) Var[E(Y X)] Var(Y ). Recall from Homework that Var(Y ) is the Bayes risk whe we estimate Y without ay iput X. This shows that usig X i our estimatio reduces the Bayes risk, ad that the improvemet is measured by Var[E(Y X)]. As a saity check, ote that if X, Y are idepedet the E(Y X) = E(Y ) so Var[E(Y X)] = 0. If X = Y the E(Y X) = Y ad Var[E(Y X)] = Var(Y ). The promiet role of variace i our aalysis above is due to the fact that we are usig the square loss. 3. Let X = {,..., 0}, let Y = {,..., 0}, ad let A = Y. Suppose the data geeratig distributio, P, has margial X Uif{,..., 0} ad coditioal distributio Y X = x Uif{,..., x}. For each loss fuctio below give a Bayes decisio fuctio. (a) l(a, y) = (a y) 2, (b) l(a, y) = a y, (c) l(a, y) = (a y). Solutio. (a) From Homework we kow that f (x) = E[Y X = x] = (x + )/2. (b) From Homework, we kow that f (x) is the coditioal media of Y give X = x. If x is odd, the f (x) = (x + )/2. If x is eve, the we ca choose ay value i the iterval [ ] x + x +,. 2 2 (c) From questio above, we kow that f (x) = arg max y P (Y = y X = x). Thus we ca choose ay iteger betwee ad x, iclusive, for f (x). 4. Show that the empirical risk is a ubiased ad cosistet estimator of the Bayes risk. You may assume the Bayes risk is fiite. Solutio. We assume a give loss fuctio l ad a i.i.d. sample (x, y i ),..., (x, y ). To show it is ubiased, ote that [ ] E[ ˆR (f)] = E l(f(x i ), y i ) i= = E[l(f(x i ), y i )] (Liearity of E) i= = E[l(f(x ), y )] (i.i.d.) = R(f). 2

For cosistecy, we must show that as we have ˆR (f) R(f) with probability. Lettig z i = l(f(x i ), y i ), we see that the z i are i.i.d. with fiite mea. Thus cosistecy follows by applyig the strog law of large umbers. 5. Let X = [0, ] ad Y = A = R. Suppose you receive the (x, y) data poits (0, 5), (.2, 3), (.37, 4.2), (.9, 3), (, 5). Throughout assume we are usig the 0 loss. (a) Suppose we restrict our decisio fuctios to the hypothesis space F of costat fuctios. Give a decisio fuctio that miimizes the empirical risk over F ad the correspodig empirical risk. Is the empirical risk miimizig fuctio uique? (b) Suppose we restrict our decisio fuctios to the hypothesis space F 2 of piecewisecostat fuctios with at most discotiuity. Give a decisio fuctio that miimizes the empirical risk over F 2 ad the correspodig empirical risk. Is the empirical risk miimizig fuctio uique? Solutio. (a) We ca let ˆf(x) = 5 or ˆf(x) = 3 ad obtai the miimal empirical risk of 3/5. Thus the empirical risk miimizer is ot uique. (b) Oe solutio is to let ˆf(x) = 5 for x [0,.] ad ˆf(x) = 3 for x (., ] givig a empirical risk of 2/5. There are ucoutably may empirical risk miimizers, so agai we do ot have uiqueess. 6. ( ) Let X = [ 0, 0], Y = A = R ad suppose the data geeratig distributio has margial distributio X Uif[ 0, 0] ad coditioal distributio Y X = x N (a + bx, ) for some fixed a, b R. Suppose you are also give the followig data poits: (0, ), (0, 2), (, 3), (2.5, 3.), ( 4, 2.). (a) Assumig the 0 loss, what is the Bayes risk? (b) Assumig the square error loss l(a, y) = (a y) 2, what is the Bayes risk? (c) Usig the full hypothesis space of all (measurable) fuctios, what is the miimum achievable empirical risk for the square error loss. (d) Usig the hypothesis space of all affie fuctios (i.e., of the form f(x) = cx + d for some c, d R), what is the miimum achievable empirical risk for the square error loss. (e) Usig the hypothesis space of all quadratic fuctios (i.e., of the form f(x) = cx 2 + dx + e for some c, d, e R), what is the miimum achievable empirical risk for the square error loss. Solutio. 3

(a) For ay decisio fuctio f the risk is give by To see this ote that E[(f(X) Y )] = P (f(x) Y ) = P (f(x) = Y ) =. P (f(x) = Y ) = 20 2π 0 0 (f(x) = y)e (y a bx)2 /2 dy dx = 20 2π 0 0 0 dx = 0. Thus every decisio fuctio is a Bayes decisio fuctio, ad the Bayes risk is. (b) By problem 2 above we kow the Bayes risk is give by sice Var(Y X = x) =. (c) We choose ˆf such that E[Var(Y X)] = E[] =, ˆf(0) =.5, ˆf() = 3, ˆf(2.5) = 3., ˆf( 4) = 2., ad ˆf(x) = 0 otherwise. The we achieve the miimum empirical risk of /0. (d) Lettig we obtai (usig a computer) ŵ = 0 0 A = 2.5, y = 2 3 3. 4 2. ( ) ˆd = (A T A) A T y = ĉ ( ).4856. 0.8556 This gives ˆR 5 ( ˆf) = 5 Aŵ y 2 2 = 0.2473. [Aside: I geeral, to solve systems like the oe above o a computer you should t actually ivert the matrix A T A, but use somethig like w=a\y i Matlab which performs a QR factorizatio of A.] (e) Lettig 0 0 0 0 A = 2.5 6.25, y = 2 3 3. 4 6 2. 4

we obtai (usig a computer) ê.775 ŵ = ˆd = (A T A) A T y = 0.7545. ĉ 0.052 This gives ˆR 5 ( ˆf) = 5 Aŵ y 2 2 = 0.928. Stochastic Gradiet Descet. Whe performig mii-batch gradiet descet, we ofte radomly choose the miibatch from the full traiig set without replacemet. Show that the resultig miibatch gradiet is a ubiased estimate of the gradiet of the full traiig set. Here we assume each decisio fuctio f w i our hypothesis space is determied by a parameter vector w R d. Solutio. Let (x m, y m ),..., (x m, y m ) be our mii-batch selected uiformly without replacemet from the full traiig set (x, y ),..., (x, y ). [ ] E w l(f w (x mi, y mi )) = E [ w l(f w (x mi ), y mi )] (Liearity of, E) i= i= = E [ w l(f w (x m ), y m )] (Margials are the same) i= = E [ w l(f w (x m ), y m )] N = N wl(f w (x i ), y i ) i= N = w l(f w (x i ), y i ) N i= (Liearity of ). 2. You wat to estimate the average age of the people visitig your website. Over a fixed week we will receive a total of N visitors (which we will call our full populatio). Suppose the populatio mea µ is ukow but the variace σ 2 is kow. Sice we do t wat to bother every visitor, we will ask a small sample what their ages are. How may visitors must we radomly sample so that our estimator ˆµ has variace at most ɛ > 0? Solutio. Let x,..., x deote our radomly sampled ages, ad let ˆx deote the sample mea i= x i. The Var(ˆx) = σ2. 5

Thus we require σ 2 /ɛ. Note that this does t deped o N, the full populatio size. 3. ( ) Suppose you have bee successfully ruig mii-batch gradiet descet with a full traiig set size of 0 5 ad a mii-batch size of 00. After receivig more data your full traiig set size icreases to 0 9. Give a heuristic argumet as to why the mii-batch size eed ot icrease eve though we have 0000 times more data. Solutio. Throughout we assume our gradiet lies i R d. Cosider the empirical distributio o the full traiig set (i.e., each sample is chose with probability /N where N is the full traiig set size). Assume this distributio has mea vector µ R d (the full-batch gradiet) ad covariace matrix Σ R d d. By the cetral limit theorem the mii-batch gradiet will be approximately ormally distributed with mea µ ad covariace Σ, where is the mii-batch size. As N grows the etries of Σ eed ot grow, ad thus eed ot grow. I fact, as N grows, the empirical mea ad covariace matrix will coverge to their true values. More precisely, the mea of the empirical distributio will coverge to E l(f(x), Y ) ad the covariace will coverge to E[( l(f(x), Y ))( l(f(x), Y )) T ] E[ l(f(x), Y )]E[ l(f(x), Y )] T where (X, Y ) P X Y. The importat takeaway here is that the size of the mii-batch is depedet o the speed of computatio, ad o the characteristics of the distributio of the gradiets (such as the momets), ad thus may vary idepedetly of the size of the full traiig set. 6