Homework Assignment 3 Due in class, Thursday October 15

Similar documents
Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Ensemble Methods: Boosting

10-701/ Machine Learning, Fall 2005 Homework 3

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Limited Dependent Variables

Kernel Methods and SVMs Extension

Generalized Linear Methods

Probabilistic Classification: Bayes Classifiers. Lecture 6:

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Linear Feature Engineering 11

Learning from Data 1 Naive Bayes

CSE 546 Midterm Exam, Fall 2014(with Solution)

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

The exam is closed book, closed notes except your one-page cheat sheet.

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Logistic Classifier CISC 5800 Professor Daniel Leeds

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lecture 10 Support Vector Machines II

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Lecture 4. Instructor: Haipeng Luo

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Economics 130. Lecture 4 Simple Linear Regression Continued

Generative classification models

Boostrapaggregating (Bagging)

Chapter 8 Indicator Variables

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Online Classification: Perceptron and Winnow

Learning Theory: Lecture Notes

Linear Regression Analysis: Terminology and Notation

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

/ n ) are compared. The logic is: if the two

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Classification as a Regression Problem

Linear Approximation with Regularization and Moving Least Squares

1 The Mistake Bound Model

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Discriminative classifier: Logistic Regression. CS534-Machine Learning

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Introduction to Regression

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

8.6 The Complex Number System

STAT 3008 Applied Regression Analysis

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

SDMML HT MSc Problem Sheet 4

Lecture 6: Introduction to Linear Regression

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Lecture Notes on Linear Regression

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Statistics for Economics & Business

Classification learning II

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Evaluation for sets of classes

Week 5: Neural Networks

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Professor Chris Murray. Midterm Exam

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Negative Binomial Regression

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Excess Error, Approximation Error, and Estimation Error

Lecture 12: Classification

1 Matrix representations of canonical matrices

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

CSC 411 / CSC D11 / CSC C11

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

Vapnik-Chervonenkis theory

Statistics MINITAB - Lab 2

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Comparison of Regression Lines

Maximum Likelihood Estimation

First Year Examination Department of Statistics, University of Florida

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Chapter 9: Statistical Inference and the Relationship between Two Variables

Statistics Chapter 4

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Chapter 1. Probability

Errors for Linear Systems

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1.

STATISTICS QUESTIONS. Step by Step Solutions.

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Lecture 10: May 6, 2013

Gravitational Acceleration: A case of constant acceleration (approx. 2 hr.) (6/7/11)

Economics 101. Lecture 4 - Equilibrium and Efficiency

Transcription:

Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data. More nformaton about ths dataset can be found n http: //statweb.stanford.edu/~tbs/elemstatlearn/datasets/prostate.nfo.txt. (a) (2.5 pts) (In class we learned about Rdge regresson wth tunng parameter λ. Defne ( ) df(λ) = tr X(X X T X + λi) λ X T. Plot the coeffcents of the covarates as λ s ncreased from 0 to 1000. A smlar plot can be found n fgure 3.8 n H-T-F. Ths fgure essentally plots the rdge regresson coeffcents of the covarates as dfλ s ncreased. (b) (2.5 pts) Now plot the coeffcents learned by Lasso as λ s ncreased from 0 to 100. For ths you can use the LARS package. (c) (5 pts) Fnally reproduce columns 4 and 5 for Rdge regresson and Lasso n Table 3.3. Remember to reproduce the test set predcton errors as well. 2 Dscrmnatve vs. Generatve Classfers A very common debate n statstcal learnng has been over generatve versus dscrmnatve models for classfcaton. In ths queston we wll explore ths ssue, both theoretcally and practcally. We wll consder Nave Bayes and logstc regresson classfcaton algorthms. To answer ths queston, you mght want to read: On Dscrmnatve vs. Generatve Classfers: A comparson of logstc regresson and Nave Bayes, Andrew Y. Ng and Mchael Jordan. In NIPS 14, 2002. http://www.robotcs.stanford.edu/~ang/papers/nps01-dscrmnatvege pdf 2.1 Logstc regresson and Nave Bayes (a) (3 pts) The dscrmnatve analog of Nave Bayes s logstc regresson. Ths means that the parametrc form of P (Y X) used by Logstc regresson s mpled by the assumptons of a Nave Bayes classfer, for some specfc class-condtonal denstes. In the readng you wll see how to prove ths for a Gaussan nave bayes classfer for contnuous nput values. Can you prove the same for bnary nputs?assume X and Y are both bnary. Assume that X Y = j s Bernoull(θ j ), where j {0, 1}, and Y s Bernoull(π). 1

2.2 (1+2+3+1 pts) Double countng the evdence (a) Consder the two class problem where class label y {T, F } and each tranng example X has 2 bnary attrbutes X 1, X 2 {T, F }. How many parameters wll you need to know/evaluate f you are to classfy an example usng the Nave Bayes classfer? Let the class pror be P (Y = T ) = 0.5 and also let P (X 1 = T Y = T ) = 0.8 and P (X 1 = F Y = F ) = 0.7., P (X 2 = T Y = T ) = 0.5 and P (X 2 = F Y = F ) = 0.9. So, attrbute X 1 provdes a slghtly stronger evdence about the class label than X 2.. Assume X 1 and X 2 are truly ndependent gven Y. Wrte down the Nave Bayes decson rule. SOLUTION: Gven an example wth attrbute values (x 1, x 2 ): Y = T f log P (Y =T ) log P (Y =F ) + log P (X 1=x 1 Y =T ) log P (X 1 =x 1 Y =F ) + log P (X 2=x 2 Y =T ) log P (X 2 =x 2 Y =F ) > 0 else Y = F.. Show that f Nave Bayes uses both attrbutes, X 1 and X 2, the error rate s 0.235. Is t better than usng only a sngle attrbute (X 1 or X 2 )? Why? The error rate s defned as the probablty that each class generates an observaton where the decson rule s ncorrect. SOLUTION: The expected error rate s the probablty that each class generates an observaton where the decson rule s ncorrect: f Y s the true label, let Ỹ (X 1, X 2 ) be the result of classfcaton (predcted class label), then the expected error rate s E e = P (X 1 =, X 2 = j, Y = k) I[k NB-Predcton(X 1 =, X 2 = j)] {T,F } j {T,F } k {T,F } (1) where I[z], s the ndcator functon. If condton z s true, then I[z] = 1, else I[z] = 0. Gven X 1, Nave Bayes wll make the followng predctons: X 1 = T Y = T, f X 1 = F Y = F. We only need to calculate P (X 1 = F, Y = T ) and P (X 1 = T, Y = F ), snce ths are the 2 cases when NB makes wrong predcton. In other 2 cases, when NB does not make mstake, the value of ndcator functon I[z] from equaton 1 wll be 0 and the whole term wll be 0. We obtan: E e = P (X 1 = F, Y = T ) + P (X 1 = T, Y = F ) = P (Y = T )P (X 1 = F Y = T ) +P (Y = F )P (X 1 = T Y = F ) = 0.5 0.2 + 0.5 0.3 = 0.25 Smlarly, for X 2 we obtan E e = 0.5 0.1 + 0.5 0.5 = 0.3 Smlarly as n 1 attrbute case: Frst, we obtan predctons of Nave Bayes, whch are: X 1 = T and X 2 = T Y = T X 1 = T and X 2 = F Y = T X 1 = F and X 2 = T Y = T X 1 = F and X 2 = F Y = F (2) 2

Agan to make calculaton shorter, we only consder cases when predcted class dffers from true class. We get: E e = P (X 1 = T, X 2 = T, Y = F ) + P (X 1 = T, X 2 = F, Y = F ) + P (X 1 = F, X 2 = T, Y = F ) + P (X 1 = F, X 2 = F, Y = T ) remember, probabltes factorze: P (X 1, X 2, Y ) = P (Y )P (X 1 Y )P (X 2 Y ) = 0.5 0.3 0.1 + 0.5 0.3 0.9 + 0.5 0.7 0.1 + 0.5 0.5 0.1 = 0.235 (3). Now, suppose that we create new attrbute X 3, whch s an exact copy of X 2. So, for every tranng example, attrbutes X 2 and X 3 have the same value, X 2 = X 3. What s the error rate of Nave Bayes now? SOLUTION: As n prevous case: Frst, we obtan predctons of Nave Bayes. Notce that now NB also consders attrbute X 3 for whch t assumes that t s condtonally ndependent of X 1, X 2 gven Y. The predctons now change, snce double countng the evdence from X 2 makes the dfference: X 1 = T and X 2 = T Y = T X 1 = T and X 2 = F Y = T X 1 = F and X 2 = T Y = F X 1 = F and X 2 = F Y = F ***predcton changed! (4) However, the nature (the true model) remans the same. We dd not ntroduce any new nformaton to the system by creatng X 3. So P (X 1, X 2, X 3, Y ) = 0 f X 2 X 3 (ths never happens), and also P (X 1, X 2, X 3, Y ) = P (X 1, X 2, Y ), snce X 3 s exact (determnst) copy of X 2. So the calculaton of expected error s the same as n queston (c), but now wth dfferent values of Y (remember, predctons of NB changed): E e = P (X 1 = T, X 2 = T, Y = F ) + P (X 1 = T, X 2 = F, Y = F ) + P (X 1 = F, X 2 = T, Y = T ) + P (X 1 = F, X 2 = F, Y = T ) probabltes then factorze as n prevous case... and after a bt of work... = 0.3 (5) COMMON MISTAKE 1: Many people calculated P (X 1, X 2, X 3, Y ) as P (X 1, X 2, X 3, Y ) = P (Y ) P (X 1 Y ) P (X 2 Y ) P (X 3 Y ), but ths s not true, snce X 2 and X 3 are not condtonally ndependent gven Y. Also, X 3 s a determnstc copy of X 2 so P (X 1, X 2, X 3, Y ) = 0 f X 2 X 3. 3

The nature, the true model, that generates the data dd not change. By duplcatng an attrbute we dd not ntroduce any new nformaton to the system, so t makes no sense for performance to mprove. Some people got the result where the error decreased, whch does not make much sense, snce we could get error rate gong to 0 by just duplcatng attrbute X 2 many tmes. v. Explan what s happenng wth Nave Bayes? SOLUTION: NB assumes attrbutes are condtonally ndependent gven the class. Ths s not true, when we ntroduce X 3, so NB over-counts the evdence from attrbute X 2 and the error ncreases. v. (extra credt 10 pts) In spte of the above fact we wll see that n some examples Nave Bayes doesn t do too badly. Consder the above example.e. your features are X 1, X 2 whch are truly ndependent gven Y and a thrd feature X 3 = X 2. Suppose you are now gven an example wth X 1 = T and X 2 = F. You are also gven the probabltes P (Y = T X 1 = T ) = p and P (Y = T X 2 = F ) = q, and P (Y = T ) =.5. Prove that the decson rule s p (1 q)2 (Hnt : use Bayes q 2 +(1 q) 2 rule agan). What s the true decson rule? Plot the two decson boundares (vary q between 0 1) and show where Nave Bayes makes mstakes. SOLUTION: Nave Bayes The decson rule for Nave Bayes s : Predct Y = T f Pr {Y = T } Pr {X = x Y = T } Pr {Y = F } Pr {X = x Y = F } But we know that Pr {Y = T } = Pr {Y = F } Pr {X = x Y = T } Pr {X = x Y = F } Usng Bayes rule agan we get, Pr {Y = T X = x } Pr {Y = T } Pr {X = x } Pr {Y = F X = x } Pr {Y = F } Pr {X = x } Now, the Pr {X = x } parts cancel from both sdes. Also Pr {Y = T } = Pr {Y = F }. Hence we now have Pr {Y = T X = x } Settng x 1 = T and x 3 = x 2 = F, we fnd : Pr {Y = F X = x } Pr {Y = T X 1 = T } Pr {Y = T X 2 = F } Pr {Y = T X 3 = F } Pr {Y = F X 1 = T } Pr {Y = F X 2 = F } Pr {Y = F X 3 = F } 4

Pluggng n the values gven n queston we now obtan: pq 2 (1 p)(1 q) 2 Thus the Nave Bayes decson rule s gven by (1 q)2 Predct Y = T p p 2 + (1 q) 2. Actual Decson Rule. The actual decson rule doesn t take n consderaton X 3, as we know that X 1 and X 2 are truly ndependent gven Y and they are enough to predct Y. Therefore, we predct true f and only f Pr {Y = T X 1 = T, X 2 = F } Pr {Y = F X1 T, X2 F }. Followng smlar calculatons as before, we have pq (1 p)(1 q) p 1 q Thus the real decson rule s gven by Predct Y = T p 1 q 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Fgure 1: True and Nave-Bayes decson boundary The curved lne s for Nave bayes, and the straght lne shows the true decson boundary. The shaded part n fgure 1 shows the regon where Nave-Bayes decson dffers from the true decson. COMMON MISTAKE 2: Some students started wth Pr {Y = T X 1 } Pr {Y = T X 2 } Pr {Y Pr {Y = F X 1 } Pr {Y = F X 2 } Pr {Y = F X 3 }. Note that condtonal ndependence does not mply ths. However for ths partcular parameters t works out to be of ths form. 2.3 Learnng Curves of Nave Bayes and Logstc Regresson Compare the two approaches on the Breast Cancer dataset you can download from course webpage. You can fnd the descrpton of ths dataset from the course webpage. We have removed the records wth mssng values for you. Obtan the learnng curves smlar to Fgure 1 n the paper. Implement a Nave Bayes classfer and a logstc regresson classfer wth the assumpton that each attrbute value for a partcular record s ndependently generated. 5

For the NB classfer, assume that P (x y), where x s a feature n the breast cancer data (that s, s the number of column n the data fle) and y s the label, of the followng multnomal dstrbuton form: For x {v 1, v 2,..., v n }, n p(x = v k y = j) = θjk s.t., j : k=1 θ jk = 1 where 0 θ jk 1 It may be easer to thnk of ths as a normalzed hstogram or as a mult-value extenson of the Bernoull. Use 2/3 of the examples for tranng and the remanng 1/3 for testng. Be sure to use 2/3 of each class, not just the frst 2/3 of data ponts. For each algorthm: 1. (3 pts) Implement the IRLS algorthm for Logstc regresson. 2. (6 pts) Plot a learnng curve: the accuracy vs. the sze of the tranng data. Generate sx ponts on the curve, usng [.01.02.03.125.625 1] fractons of your tranng set and testng on the full test set each tme. Average your results over 5 random splts of the data nto a tranng and test set (always keep 2/3 of the data for tranng and 1/3 for testng, but randomze over whch ponts go to tranng set and whch to testng). Ths averagng wll make your results less dependent on the order of records n the fle. Plot both the Nave Bayes and Logstc Regresson, learnng curves on the same plot. Use the plot(x,y) functon n Matlab snce the tranng data fractons are not equally spaced. Specfy your choce of pror/regularzaton parameters and keep those parameters constant for these tests. A typcal choce of constants would be to add 1 to each bn before normalzng (for NB) and λ = 0 (for LR). SOLUTION: Please look at fgure 2. Somethng to remember : many students treated the dscrete values of each attrbute categorcally,.e. have a dfferent weght for each possble value of a feature. However note that here the values of the features mean somethng,.e. strength of that feature, so ts better to use them numercally. I have not penalzed any of these solutons. 3. (1 pt) What conclusons can you draw from your experments? Specfcally, what can you say about speed of convergence of the classfers? Are these consstent wth the results n the NIPS paper that we have mentoned? If yes, state that. If no, explan why not. SOLUTION: From fgure 2 t can be observed that NB gets to ts asymptotc error much faster than LR. For some of you the two curves wll cross, and for some of you they wll come very close but not cross. Both of these are fne because the data set s too small compared to the number of covarates to actually observe the effect reported n Ng et al s paper. 6

0.7 0.6 LR NB 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Fgure 2: SOLUTION: Your plot should look somethng smlar to ths. 7