NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Similar documents
Machine Learning Brett Bernstein

Machine Learning Brett Bernstein

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Problem Set 4 Due Oct, 12

Lecture 13: Maximum Likelihood Estimation

Agnostic Learning and Concentration Inequalities

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Estimation for Complete Data

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 11 and 12: Basic estimation theory

Efficient GMM LECTURE 12 GMM II

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Lecture 3. Properties of Summary Statistics: Sampling Distribution

10-701/ Machine Learning Mid-term Exam Solution

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Statistical Inference Based on Extremum Estimators

Lecture 33: Bootstrap

AMS570 Lecture Notes #2

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Exponential Families and Bayesian Inference

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Introduction to Machine Learning DIS10

Random Variables, Sampling and Estimation

Empirical Process Theory and Oracle Inequalities

Output Analysis and Run-Length Control

REGRESSION WITH QUADRATIC LOSS

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

ECE 901 Lecture 13: Maximum Likelihood Estimation

Topic 9: Sampling Distributions of Estimators

Expectation and Variance of a random variable

Intro to Learning Theory

This section is optional.

Mathematical Statistics - MS

Convergence of random variables. (telegram style notes) P.J.C. Spreij

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

EE 4TM4: Digital Communications II Probability Theory

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Unbiased Estimation. February 7-12, 2008

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

32 estimating the cumulative distribution function

Lecture 2: Monte Carlo Simulation

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Properties and Hypothesis Testing

Lecture Chapter 6: Convergence of Random Sequences

Lecture 12: September 27

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Ma 530 Introduction to Power Series

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

STAT Homework 2 - Solutions

1 Introduction to reducing variance in Monte Carlo simulations

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Topic 9: Sampling Distributions of Estimators

Lecture 7: October 18, 2017

1 Review of Probability & Statistics

Summary. Recap ... Last Lecture. Summary. Theorem

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

1 Inferential Methods for Correlation and Regression Analysis

6. Sufficient, Complete, and Ancillary Statistics

7.1 Convergence of sequences of random variables

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Topic 9: Sampling Distributions of Estimators

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

Section 11.8: Power Series

Regression with quadratic loss

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

ECON 3150/4150, Spring term Lecture 3

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

1 Covariance Estimation

Simulation. Two Rule For Inverting A Distribution Function

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

An Introduction to Randomized Algorithms

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Stat410 Probability and Statistics II (F16)

Lecture 19: Convergence

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

An Introduction to Asymptotic Theory

Naïve Bayes. Naïve Bayes

Advanced Stochastic Processes.

Chapter 6 Sampling Distributions

Basics of Probability Theory (for Theory of Computation courses)

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Chapter 6 Infinite Series

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Transcription:

NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we will be providig cocept checks for review. Each cocept check will: List the lab/lecture learig objectives. You will be resposible for masterig these objectives, ad demostratig mastery through homework assigmets, exams (midterm ad fial), ad o the fial course project. Iclude cocept check questios. These questios are iteded to reiforce the lab/lectures, ad help you master the learig objectives. You are strogly ecourage to complete all cocept check questios, ad to discuss these (ad related) problems o Piazza ad at office hours. However, problems marked with a ( ) are cosidered optioal. Lecture : Itroductio to Statistical Learig Theory Topic : Statistical Learig Theory Learig Objectives. Idetify the iput, actio, ad outcome spaces for a give machie learig problem. 2. Provide a example for which the actio space ad outcome spaces are the same ad oe for which they are differet. 3. Explai the relatioships betwee the decisio fuctio, the loss fuctio, the iput space, the actio space, ad the outcome space. 4. Defie the risk of a decisio fuctio ad a Bayes decisio fuctio. Brett authored these cocept checks for Sprig 207 DS-GA 003, ad the work is almost etirely his. Later (mior) modificatios were made by David Roseberg ad Be Jakubowski.

5. Provide example decisio problems for which the Bayes risk is 0 ad the Bayes risk is ozero. 6. Kow the Bayes decisio fuctios for square loss ad multiclass 0/ loss. 7. Defie the empirical risk for a decisio fuctio ad the empirical risk miimizer. 8. Explai what a hypothesis space is, ad how it ca be used with costraied empirical risk miimizatio to cotrol overfittig. Cocept Check Questios. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete joit distributio. Compute a Bayes decisio fuctio whe the loss fuctio l : A Y R is give by l(a, y) = (a y), the 0 loss. Solutio. The Bayes decisio fuctio f satisfies f = arg mi f where (X, Y ) P X Y. Let R(f) = arg mi f E[(f(X) Y )] = arg mi P (f(x) Y ), f f (x) = arg max P (Y = y X = x), y the maximum a posteriori estimate of Y. If there is a tie, we choose ay of the maximizers. If f 2 is aother decisio fuctio we have P (f (X) Y ) = x P (f (x) Y X = x)p (X = x) = x ( P (f (x) = Y X = x))p (X = x) x ( P (f 2(x) = Y X = x))p (X = x) (Def of f ) = x P (f 2(x) Y X = x)p (X = x) = P (f 2 (X) Y ). Thus f = f. 2. ( ) Suppose A = Y = R, X is some other set, ad l : A Y R is give by l(a, y) = (a y) 2, the square error loss. What is the Bayes risk ad how does it compare with the variace of Y? Solutio. From Homework we kow that the Bayes decisio fuctio is give by f (x) = E[Y X = x]. Thus the Bayes risk is give by E[(f (X) Y ) 2 ] = E[(E[Y X] Y ) 2 ] = E[E[(E[Y X] Y ) 2 X]] = E[Var(Y X)], 2

where we applied the law of iterated expectatios. The law of total variace states that Var(Y ) = E[Var(Y X)] + Var[E(Y X)]. This proves the Bayes risk satisfies E[Var(Y X)] = Var(Y ) Var[E(Y X)] Var(Y ). Recall from Homework that Var(Y ) is the Bayes risk whe we estimate Y without ay iput X. This shows that usig X i our estimatio reduces the Bayes risk, ad that the improvemet is measured by Var[E(Y X)]. As a saity check, ote that if X, Y are idepedet the E(Y X) = E(Y ) so Var[E(Y X)] = 0. If X = Y the E(Y X) = Y ad Var[E(Y X)] = Var(Y ). The promiet role of variace i our aalysis above is due to the fact that we are usig the square loss. 3. Let X = {,..., 0}, let Y = {,..., 0}, ad let A = Y. Suppose the data geeratig distributio, P, has margial X Uif{,..., 0} ad coditioal distributio Y X = x Uif{,..., x}. For each loss fuctio below give a Bayes decisio fuctio. (a) l(a, y) = (a y) 2, (b) l(a, y) = a y, (c) l(a, y) = (a y). Solutio. (a) From Homework we kow that f (x) = E[Y X = x] = (x + )/2. (b) From Homework, we kow that f (x) is the coditioal media of Y give X = x. If x is odd, the f (x) = (x + )/2. If x is eve, the we ca choose ay value i the iterval [ ] x + x +,. 2 2 (c) From questio above, we kow that f (x) = arg max y P (Y = y X = x). Thus we ca choose ay iteger betwee ad x, iclusive, for f (x). 4. Show that the empirical risk is a ubiased ad cosistet estimator of the Bayes risk. You may assume the Bayes risk is fiite. Solutio. We assume a give loss fuctio l ad a i.i.d. sample (x, y i ),..., (x, y ). 3

To show it is ubiased, ote that [ ] E[ ˆR (f)] = E l(f(x i ), y i ) i= = E[l(f(x i ), y i )] (Liearity of E) i= = E[l(f(x ), y )] (i.i.d.) = R(f). For cosistecy, we must show that as we have ˆR (f) R(f) with probability. Lettig z i = l(f(x i ), y i ), we see that the z i are i.i.d. with fiite mea. Thus cosistecy follows by applyig the strog law of large umbers. 5. Let X = [0, ] ad Y = A = R. Suppose you receive the (x, y) data poits (0, 5), (.2, 3), (.37, 4.2), (.9, 3), (, 5). Throughout assume we are usig the 0 loss. (a) Suppose we restrict our decisio fuctios to the hypothesis space F of costat fuctios. Give a decisio fuctio that miimizes the empirical risk over F ad the correspodig empirical risk. Is the empirical risk miimizig fuctio uique? (b) Suppose we restrict our decisio fuctios to the hypothesis space F 2 of piecewisecostat fuctios with at most discotiuity. Give a decisio fuctio that miimizes the empirical risk over F 2 ad the correspodig empirical risk. Is the empirical risk miimizig fuctio uique? Solutio. (a) We ca let ˆf(x) = 5 or ˆf(x) = 3 ad obtai the miimal empirical risk of 3/5. Thus the empirical risk miimizer is ot uique. (b) Oe solutio is to let ˆf(x) = 5 for x [0,.] ad ˆf(x) = 3 for x (., ] givig a empirical risk of 2/5. There are ucoutably may empirical risk miimizers, so agai we do ot have uiqueess. 6. ( ) Let X = [ 0, 0], Y = A = R ad suppose the data geeratig distributio has margial distributio X Uif[ 0, 0] ad coditioal distributio Y X = x N (a + bx, ) for some fixed a, b R. Suppose you are also give the followig data poits: (0, ), (0, 2), (, 3), (2.5, 3.), ( 4, 2.). (a) Assumig the 0 loss, what is the Bayes risk? (b) Assumig the square error loss l(a, y) = (a y) 2, what is the Bayes risk? 4

(c) Usig the full hypothesis space of all (measurable) fuctios, what is the miimum achievable empirical risk for the square error loss. (d) Usig the hypothesis space of all affie fuctios (i.e., of the form f(x) = cx + d for some c, d R), what is the miimum achievable empirical risk for the square error loss. (e) Usig the hypothesis space of all quadratic fuctios (i.e., of the form f(x) = cx 2 + dx + e for some c, d, e R), what is the miimum achievable empirical risk for the square error loss. Solutio. (a) For ay decisio fuctio f the risk is give by To see this ote that P (f(x) = Y ) = E[(f(X) Y )] = P (f(x) Y ) = P (f(x) = Y ) =. 20 2π 0 0 (f(x) = y)e (y a bx)2 /2 dy dx = 20 2π Thus every decisio fuctio is a Bayes decisio fuctio, ad the Bayes risk is. (b) By problem 2 above we kow the Bayes risk is give by sice Var(Y X = x) =. (c) We choose ˆf such that E[Var(Y X)] = E[] =, ˆf(0) =.5, ˆf() = 3, ˆf(2.5) = 3., ˆf( 4) = 2., 0 ad ˆf(x) = 0 otherwise. The we achieve the miimum empirical risk of /0. (d) Lettig 0 0 A = 2.5, y = 2 3 3. 4 2. we obtai (usig a computer) ( ) ˆd ŵ = = (A T A) A T y = ĉ This gives ( ).4856. 0.8556 ˆR 5 ( ˆf) = 5 Aŵ y 2 2 = 0.2473. [Aside: I geeral, to solve systems like the oe above o a computer you should t actually ivert the matrix A T A, but use somethig like w=a\y i Matlab which performs a QR factorizatio of A.] 5 0 0 dx = 0.

(e) Lettig we obtai (usig a computer) 0 0 0 0 A = 2.5 6.25, y = 2 3 3. 4 6 2. ê.775 ŵ = ˆd = (A T A) A T y = 0.7545. ĉ 0.052 This gives ˆR 5 ( ˆf) = 5 Aŵ y 2 2 = 0.928. Topic 2: Stochastic Gradiet Descet Learig Objectives. Be able to write the empirical risk for a particular loss fuctio over a particular parameterized hypothesis space, such as for square loss over a hypothesis space of liear fuctios. 2. Compare ad costrast gradiet descet, miibatch gradiet descet, ad stochastic gradiet descet. Cocept Check Questios. Whe performig mii-batch gradiet descet, we ofte radomly choose the miibatch from the full traiig set without replacemet. Show that the resultig miibatch gradiet is a ubiased estimate of the gradiet of the full traiig set. Here we assume each decisio fuctio f w i our hypothesis space is determied by a parameter vector w R d. Solutio. Let (x m, y m ),..., (x m, y m ) be our mii-batch selected uiformly without 6

replacemet from the full traiig set (x, y ),..., (x, y ). [ ] E w l(f w (x mi, y mi )) = E [ w l(f w (x mi ), y mi )] (Liearity of, E) i= i= = E [ w l(f w (x m ), y m )] (Margials are the same) i= = E [ w l(f w (x m ), y m )] N = N wl(f w (x i ), y i ) i= N = w l(f w (x i ), y i ) N i= (Liearity of ). 2. You wat to estimate the average age of the people visitig your website. Over a fixed week we will receive a total of N visitors (which we will call our full populatio). Suppose the populatio mea µ is ukow but the variace σ 2 is kow. Sice we do t wat to bother every visitor, we will ask a small sample what their ages are. How may visitors must we radomly sample so that our estimator ˆµ has variace at most ɛ > 0? Solutio. Let x,..., x deote our radomly sampled ages, ad let ˆx deote the sample mea i= x i. The Var(ˆx) = σ2. Thus we require σ 2 /ɛ. Note that this does t deped o N, the full populatio size. 3. ( ) Suppose you have bee successfully ruig mii-batch gradiet descet with a full traiig set size of 0 5 ad a mii-batch size of 00. After receivig more data your full traiig set size icreases to 0 9. Give a heuristic argumet as to why the mii-batch size eed ot icrease eve though we have 0000 times more data. Solutio. Throughout we assume our gradiet lies i R d. Cosider the empirical distributio o the full traiig set (i.e., each sample is chose with probability /N where N is the full traiig set size). Assume this distributio has mea vector µ R d (the full-batch gradiet) ad covariace matrix Σ R d d. By the cetral limit theorem the mii-batch gradiet will be approximately ormally distributed with mea µ ad covariace Σ, where is the mii-batch size. As N grows the etries of Σ eed ot grow, ad thus eed ot grow. I fact, as N grows, the empirical mea ad covariace matrix will coverge to their true values. More precisely, the mea of the 7

empirical distributio will coverge to E l(f(x), Y ) ad the covariace will coverge to E[( l(f(x), Y ))( l(f(x), Y )) T ] E[ l(f(x), Y )]E[ l(f(x), Y )] T where (X, Y ) P X Y. The importat takeaway here is that the size of the mii-batch is depedet o the speed of computatio, ad o the characteristics of the distributio of the gradiets (such as the momets), ad thus may vary idepedetly of the size of the full traiig set. 8