Richard Socher, Henning Peters Elements of Statistical Learning I E[X] = arg min. E[(X b) 2 ]

Similar documents
The Schrödinger Equation

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

What would be a reasonable choice of the quantization step Δ?

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lecture 6: Introduction to Linear Regression

Lecture 12: Classification

Generalized Linear Methods

Math1110 (Spring 2009) Prelim 3 - Solutions

Lecture Notes on Linear Regression

Differentiating Gaussian Processes

Negative Binomial Regression

Homework Assignment 3 Due in class, Thursday October 15

More metrics on cartesian products

ˆ (0.10 m) E ( N m /C ) 36 ˆj ( j C m)

Singular Value Decomposition: Theory and Applications

15-381: Artificial Intelligence. Regression and cross validation

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

10-701/ Machine Learning, Fall 2005 Homework 3

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

SIMPLE LINEAR REGRESSION

PES 1120 Spring 2014, Spendier Lecture 6/Page 1

2.3 Nilpotent endomorphisms

Linear Feature Engineering 11

Affine transformations and convexity

DECOUPLING THEORY HW2

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Why Monte Carlo Integration? Introduction to Monte Carlo Method. Continuous Probability. Continuous Probability

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Chapter 3 Describing Data Using Numerical Measures

APPENDIX A Some Linear Algebra

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Which Separator? Spring 1

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Kernel Methods and SVMs Extension

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

8.1 Arc Length. What is the length of a curve? How can we approximate it? We could do it following the pattern we ve used before

MTH 263 Practice Test #1 Spring 1999

Support Vector Machines

Efficient, General Point Cloud Registration with Kernel Feature Maps

Learning Theory: Lecture Notes

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Error Bars in both X and Y

Pattern Classification

18.1 Introduction and Recap

Linear Regression Analysis: Terminology and Notation

Errors for Linear Systems

CHAPTER 13. Exercises. E13.1 The emitter current is given by the Shockley equation:

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Report on Image warping

Section 8.3 Polar Form of Complex Numbers

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

A be a probability space. A random vector

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Linear Classification, SVMs and Nearest Neighbors

Distribution det x s on p-adic matrices

Credit Card Pricing and Impact of Adverse Selection

Online Classification: Perceptron and Winnow

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

First Year Examination Department of Statistics, University of Florida

Regression. The Simple Linear Regression Model

Module 14: THE INTEGRAL Exploring Calculus

/ n ) are compared. The logic is: if the two

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Solutions to exam in SF1811 Optimization, Jan 14, 2015

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Explaining the Stein Paradox

CONDUCTORS AND INSULATORS

Lecture 10 Support Vector Machines II

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Georgia Tech PHYS 6124 Mathematical Methods of Physics I

ACTM State Calculus Competition Saturday April 30, 2011

Cathy Walker March 5, 2010

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Scatter Plot x

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Feature Selection: Part 1

CS-433: Simulation and Modeling Modeling and Probability Review

One-sided finite-difference approximations suitable for use with Richardson extrapolation

Section 8.1 Exercises

Chapter 13: Multiple Regression

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

Ph 219a/CS 219a. Exercises Due: Wednesday 23 October 2013

Comparison of Regression Lines

Statistics MINITAB - Lab 2

Calculus of Variations Basics

THE SMOOTH INDENTATION OF A CYLINDRICAL INDENTOR AND ANGLE-PLY LAMINATES

The big picture. Outline

Transcription:

1 Prolem (10P) Show that f X s a random varale, then E[X] = arg mn E[(X ) 2 ] Thus a good predcton for X s E[X] f the squared dfference s used as the metrc. The followng rules are used n the proof: 1. E[X + Y ] = E[X] + E[Y ] 2. E[cX] = c E[X] 3. E[X] and E[X 2 ] are constant 4. the expected value of a constant s the constant tself Proof. E[X] = arg mn E[(X ) 2 ] = arg mn E[X 2 2X + 2 ] = arg mn E[X 2 ] E[2X] + E[ 2 ] = arg mn E[X 2 ] 2E[X] + 2 In order to fnd the for whch the rght sde of the last equaton has a mnmum, we can see t as a functon of : f() = 2 2E[X] + E[X 2 ] Ths functon s ovously to the second power of wth a postve a 0, hence the only extremum s a gloal mnmum. The dervaton of ths functon s: To fnd the mnmum, we set t to 0: Whch gves: f () = 2 2E[X] 0 = 2 2E[X] = E[X] Thus we have shown that wth = E[X], the functon s mnmal. Intutvely ths also seems correct. If you predct the outcome of a random varale X wth E[X], the error should e mnmal. 2 Prolem (20P) Consder N data ponts unformly dstruted n a p-dmensonal unt all centered at the orgn. Suppose we consder a nearest-neghor estmate at the orgn. Show that the medan dstance from the orgn to the closest data pont s gven y the expresson d(p, N) = ( 1 1 2 ) 1 1 p N 1-6

(Exercse 2.3 n Haste, Tshran, Fredman). What does ths mean for the k-nearest neghor algorthm? Hnt: Consder that the volume of a p-dmensonal sphere wth radus r s gven y V (r, p) = G(p)r p, wth G(p) a dmenson-dependent constant. The proalty that a pont falls nto a sphere of radus r s proportonal to the sphere s volume snce the ponts are unformly dstruted. The proof s separated nto two parts. In the frst part the equaton s proven for one dmenson. Ths manly helped us to grasp the concept. The second part then only uses the hnt and apples the same dea to p-dmensons. Proof. To calculate the medan dstance from the orgn to the closest data pont, N random varale samples (rvs) are needed. Based on these, the medan of the pont nearest to the orgn can e defned. The rvs are ndependent of each other and unformly dstruted: X 1,..., X N Unf([0, 1]) For all of them the cumulatve dstruton functon (cdf) demonstrates that the proalty for a pont to e smaller than 0 s 0, then the proalty s equal for all pont etween [0, 1] and thus the cdf ncreases lnearly, up to 1 and stays at 1. Formally, we say: : 0, x < 0 P (X x) = F X = x, 0 x 1 1, x > 1 We need ths cdf, ecause we want the proalty that a pont s n a certan nterval [0, r] wth r 1. Also P (X > x) = 1 x, ecause t s lnear. Next, the random varale R = mn {X } s defned. It denotes the smallest value of all samples X. Usng F X we can smplfy the cdf of R. P (R r) = F R (r) = P (mn{x } r) = 1 P (mn{x } > r) = 1 P (X 1 > r,..., X N > r) = 1 P (X > r) = 1 =1 (1 r) =1 = 1 (1 r) N The cdf of R denotes the proalty that the shortest dstances of a pont to the orgn n all rvs X s less or equal than the cdf s parameter r. Now, we can use the fact that the medan s the 0.5 quantle of ts cdf, so F R (r 0.5 ) = 0.5. Hence: 1 (1 r 0.5 ) N = 0.5 We conclude: 1 (1 r 0.5 ) N = 0.5 1 r 0.5 = N 0.5 Thus, the expresson s proven for the 1-dmensonal case. r 0.5 = 1 N 0.5 = d(1, N) For the p-dmensonal case we use the hnt: The proalty that a pont falls nto a sphere of radus r s proportonal to the sphere s volume snce the ponts are unformly dstruted. 2-6

Snce we are talkng aout the un all, the radus s 1 and oth alls are centered at the orgn. Thus, wth S p (r) eng the volume of a p-dmensonal sphere wth a radus r the cdf s: P (X S p (r)) = G(p)rp G(p)1 p = r p If an element s nsde a all, ts dstance from the orgn s smaller than the radus of ths all. Wth the Eucldean norm x of any p-dmensonal vector x, t s true that: P (X S p (r)) = P ( X r) = r p smlarly to the 1-dmensonal case we now defne another random varale on top of that: We apply the same dea as n 1-d: R = mn { X } P (R r) = F R (r) = P (mn{ X } r) = 1 P (mn{ X } > r) = 1 P ( X > r) = 1 =1 (1 P ( X r)) =1 = 1 (1 r p ) N Agan we set ths cdf to 0.5, ecause we are lookng for the merdan. 1 (1 r p 0.5 )N = 0.5 r 0.5 = (1 0.5 1 N ) 1 p = d(p, N) Ths means that wth lm p, all ponts are extremely close to the order of the sphere and further away from the orgn. Defnng nearest neghor areas for a specfc class ecomes harder, snce one must extrapolate outsde of the doman of gven neghorng ponts, rather than nterpolate n the area etween them. Furthermore the samplng densty decreases, that s why the same amount of tranng samples can not cover a hgher dmenson approprately, as t could n a lower dmenson. 3 Prolem - R(20P) 3.1 Q&A Males n the regon of Western Cape, South Afrca present hgh-rsk heart-dsease. Measurements of systolc lood pressure (sp), cumulatve toacco consumpton (toacco), low densty lpoproten cholesterol (ldl), adposty (adposty), famly hstory of heart dsease (famhst), type-a ehavor (typea), oesty (oesty), current alcohol consumpton (alcohol) and age at onset (age) are avalale for a numer of male patents, together wth correspondng nformaton on whether they suffer or not from coronary heart dsease (chd). The goal s to analyze ths data n order to understand how the aove characterstcs nfluence the presence or asence of heart dsease, so that to make possle the predcton of the llness n new, unseen patents. Requrements: 3-6

1. Whch are the nput varales? sp toacco ldl adposty famhst typea oesty alcohol age 2. Whch s the target varale? chd 3. Whch s the (tranng) data set? The 462 oservatons of the ten varales. 4. Whch are the samples? Each lne, denoted y a numer etween 1 and 462 (wth 262 mssng) 5. Is ths a classfcaton or a regresson prolem? Classfcaton. 6. Use R to vsualze the data. See elow Results from f) 1. How many samples are there? 462 2. For each feature, specfy whether t s qualtatve or quanttatve. Wth the excepton of famhst and chd whch are oth qualtatve, all other features are quanttatve. See the output of str(ahd) 3. How many patents have coronary heart dsease and how many don t? 160 have chd, 320 don t. 4. How many sujects of age 20 are there? 6 3.2 R - I/O > load("ahd.rdata") > str(ahd) data.frame : 462 os. of 10 varales: $ sp : nt 160 144 118 170 134 132 142 114 114 132... $ toacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0... $ ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8... $ adposty: num 23.1 28.6 32.3 38.0 27.8... $ famhst : Factor w/ 2 levels "Asent","Present": 2 1 2 2 2 2 1 2 2 2... $ typea : nt 49 55 52 51 60 62 59 62 49 69... $ oesty : num 25.3 28.9 29.1 32.0 26.0... $ alcohol : num 97.20 2.06 3.81 24.26 57.34... $ age : nt 52 63 46 58 49 45 38 58 29 53... $ chd : nt 1 1 0 1 1 0 0 1 0 1... > names(ahd) [1] "sp" "toacco" "ldl" "adposty" "famhst" "typea" "oesty" "alcohol" "age" "chd" > dm(ahd) [1] 462 10 > class(ahd$alcohol) [1] "numerc" > summary(ahd) sp toacco ldl adposty famhst Mn. :101.0 Mn. : 0.0000 Mn. : 0.980 Mn. : 6.74 Asent :270 1st Qu.:124.0 1st Qu.: 0.0525 1st Qu.: 3.283 1st Qu.:19.77 Present:192 Medan :134.0 Medan : 2.0000 Medan : 4.340 Medan :26.11 Mean :138.3 Mean : 3.6356 Mean : 4.740 Mean :25.41 3rd Qu.:148.0 3rd Qu.: 5.5000 3rd Qu.: 5.790 3rd Qu.:31.23 Max. :218.0 Max. :31.2000 Max. :15.330 Max. :42.49 typea oesty alcohol age chd Mn. :13.0 Mn. :14.70 Mn. : 0.00 Mn. :15.00 Mn. :0.0000 1st Qu.:47.0 1st Qu.:22.98 1st Qu.: 0.51 1st Qu.:31.00 1st Qu.:0.0000 Medan :53.0 Medan :25.81 Medan : 7.51 Medan :45.00 Medan :0.0000 Mean :53.1 Mean :26.04 Mean : 17.04 Mean :42.82 Mean :0.3463 3rd Qu.:60.0 3rd Qu.:28.50 3rd Qu.: 23.89 3rd Qu.:55.00 3rd Qu.:1.0000 4-6

Max. :78.0 Max. :46.58 Max. :147.19 Max. :64.00 Max. :1.0000 > tale(ahd$chd) 0 1 302 160 > tale(ahd$age) 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 3 20 17 8 2 6 3 2 6 4 5 11 7 7 7 9 11 9 6 1 3 6 13 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 10 12 10 13 8 8 14 11 3 13 14 8 8 10 14 10 16 9 8 17 16 15 16 12 8 13 3.3 Vsualzaton and Results Fgure 1 shows that oesty and adposty are somehow correlated. However no real conclusons can e dscovered for the resultng chd. Fgure 2 shows that age and chd are not correlated. Fgure 3 s a lttle more complex. It s created y: Fgure 1: pars(ahd) - scatterplot of all pars 5-6

Fgure 2: plot(ahdage, ahdchd) - scatterplot of age-chd scatterplot(oesty~adposty famhst, reg.lne=lm, smooth=false, laels=false, oxplots= xy, span=0.5, y.groups=true, data=ahd) # to prnt t drectly to a fle: dev.prnt(png, flename="adposty-oesty.png", wdth=500, heght=500) It shows that a famly hstory s of chd does not nfluence the relaton etween oesty and adposty. Furthermore t demonstrates that the hgher the adposty, the hgher the oesty. The Box-Whsker-plots next to the axes show that the medans of oth adposty and oesty s slghtly aove 25. Fgure 3: scatterplot(oesty adposty famhst...) - scatterplot of oesty-dposty, wth two groups that ndcate famhst Fgure 4 s the most nterestng. It s created y scatterplot.matrx(~chd+ldl+oesty+sp+toacco+typea, reg.lne=lm, smooth=false, span=0.5, dagonal = densty, data=ahd) It shows on ts dagonal, the densty. We can extract some nformaton: There are more samples wthout chd. oesty, typea and ldl seem to have a Gaussan dstruton the hgher toacco use and ldl are the more lkely the patent has chd. the same holds for the other features, ut weaker 6-6

Fgure 4: scatterplot.matrx( chd+ldl+oesty+sp+toacco+typea,...) - scatterplot matrx 7-6