CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Similar documents
Classification as a Regression Problem

Lecture Notes on Linear Regression

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Generalized Linear Methods

Machine learning: Density estimation

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Expectation Maximization Mixture Models HMMs

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Maximum Likelihood Estimation

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Linear Approximation with Regularization and Moving Least Squares

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Economics 130. Lecture 4 Simple Linear Regression Continued

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

10-701/ Machine Learning, Fall 2005 Homework 3

Discriminative classifier: Logistic Regression. CS534-Machine Learning

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Maximum Likelihood Estimation (MLE)

Generative classification models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Classification learning II

Which Separator? Spring 1

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Week 5: Neural Networks

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

SDMML HT MSc Problem Sheet 4

Kernel Methods and SVMs Extension

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

STAT 3008 Applied Regression Analysis

Tracking with Kalman Filter

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Composite Hypotheses testing

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Logistic Regression Maximum Likelihood Estimation

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

e i is a random error

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

First Year Examination Department of Statistics, University of Florida

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Rockefeller College University at Albany

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Hydrological statistics. Hydrological statistics and extremes

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Limited Dependent Variables

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Homework Assignment 3 Due in class, Thursday October 15

Basic Statistical Analysis and Yield Calculations

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Semi-Supervised Learning

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Course 395: Machine Learning - Lectures

Chapter Newton s Method

1 Convex Optimization

x i1 =1 for all i (the constant ).

4.3 Poisson Regression

The exam is closed book, closed notes except your one-page cheat sheet.

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

Mean Field / Variational Approximations

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Support Vector Machines

Boostrapaggregating (Bagging)

Lecture Nov

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Lecture 2 Solution of Nonlinear Equations ( Root Finding Problems )

Support Vector Machines

Chapter 11: Simple Linear Regression and Correlation

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Estimation: Part 2. Chapter GREG estimation

Lecture 12: Classification

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Feb 14: Spatial analysis of data fields

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

4DVAR, according to the name, is a four-dimensional variational method.

Hidden Markov Models

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Statistics for Economics & Business

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Basic Business Statistics, 10/e

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Polynomial Regression Models

Linear Feature Engineering 11

Feature Selection: Part 1

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Basically, if you have a dummy dependent variable you will be estimating a probability.

EEE 241: Linear Systems

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Ensemble Methods: Boosting

Transcription:

CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute and ts value s a constant equal to. Lnear regresson can also be represented n a graphc form: 0 θ 0 θ. + output.... θ Goal: nmze ean Square Error (SE): SE = ( y f ( ; θ)) = SE s a quadratc functon n parameters θ It s a conve functon There s only one mnmum, t s the global mnmum SE Soluton: Suffcent condton s = 0, θ, = 0,,,. Therefore, fnd θ such that SE = θ = 0, y k k k There are + lnear equatons wth + unknown varables we can get a closedform soluton.

Specal Case: If some attrbute s a lnear combnaton of others, there s no unque soluton. SE = 0 = y = θ k = k = 0 k (n matr form) X T Y = X T Xθ, where: X [ (+)] = { } =:, =:(+), ( s th attrbute of th data pont) Y [ ] = {y } =:, θ [(+) ] = {θ } =:(+) ote: D = [X Y],.e., [X Y] s what we defned prevously as the data set. The optmal parameter choce s then: θ = (X T X) - X T Y, whch s a closed form soluton. ote: the above soluton ests f X T X s nvertble,.e. f ts rank equals +,.e. no attrbute s a lnear combnaton of others (n atlab, use functon rank). ote: usng matr dervatons we can do the optmzaton n a more elegant way by defnng Statstcal results: SE = (Y Xθ) T (Y Xθ) θ SE = X T (Y Xθ) = 0 [(+) ] θ = (X T X) - X T Y Assumpton: the true data generatng process (DGP) s y = β + e, e s nose wth E(e) = 0, Var(e)= σ =0 ote: Ths s a bg assumpton! Questons: How close s the estmate θ to the true value β? Answer : E[θ] = E[(X T X) - X T Y] = (X T X) - X T E[Y] (remember, Y=Xβ+e [] ) E[θ] = (X T X) - X T Xβ + (X T X) - X T E[e] E[θ] = β + 0 = β Concluson: f we repeat lnear regresson on dfferent data sets sampled accordng to the true DGP, the average θ wll equal β (.e., E[θ] = β), whch are the true parameters. Therefore, the lnear regresson s an unbased predctor. Answer : The varance of parameter estmate θ s Var[θ] = (after some calculaton) = (X T X) - σ Concluson: Var[θ] s a measure of how dfferent estmaton θ s from the true parameters β,.e. how successful s the lnear regresson. Therefore, qualty of lnear

regresson depends on the nose level (.e. σ ) and on the data sze. The varance ncreases lnearly wth σ and decreases as / wth the sze of the dataset. ore strngent assumpton: the true DGP s y = β + e, and e ~ (0, σ ) (.e., e s Gaussan addtve nose) =0 If the assumpton s vald we could: Estmate θ can be consdered as a multdmensonal Gaussan varable wth θ = (β, (X T X) - σ ). Therefore, we could do some nce thng such as test the hypothess that β =0 (.e. that attrbute s not nfluencng the target y). onlnear Regresson Queston: What f we know that f(;θ) s a non-lnear parametrc functon? For eample: f(;θ) = θ 0 + θ θ, ths s a functon nonlnear n parameters. Soluton: nmze SE = ( y f ( ; θ)) Start from the necessary condton for mnmum: SE f ( ; ) = ( ( ; )) θ y f θ = 0 Agan, we have to solve nonlnear equatons wth unknowns. But, ths tme closed-form soluton s not easy to derve. ath Background: Unconstraned Optmzaton: Problem: Gven f(), fnd ts mnmum. Popular Soluton: Use the gradent descent algorthm. Idea: The gradent of f() at the mnmum s zero vector. So,. start from an ntal guess 0 ;. calculate gradent f( 0 ); 3. move n the drecton opposte of the gradent,.e., generate new guess as = 0 α f( 0 ), where α s a properly selected constant; 4. repeat ths process untl convergence to the mnmum.

Two problems wth gradent descent algorthm:. It accepts convergence to a local mnmum. The smplest soluton to avod the local mnmum s to repeat the procedure startng from multple ntal guesses 0.. Possble slow convergence to a mnmum. There are a number of algorthms provdng faster convergence (e.g. conugate gradent; second order methods such as ewton or quaz-ewton; nondervatve methods) Back to solvng nonlnear regresson usng gradent descent procedure: Step : Start from an ntal guess for parameters θ 0. Step k: Update the parameters as θ k+ = θ k α f(θ k ) Specal Case: For lnear predcton the update step would be θ k+ = θ k +αx T (Y Xθ k ) Logstc Regresson by SE nmzaton Remember: Classfcaton can be solved by SE mnmzaton methods (E[y ] can be used to derve posterors P(y C )). Queston: What functonal form f(;θ) can be an approprate choce for representng posteror class probabltes? Opton : What about lnear model f(;θ) = θ 0-, so t s not a good choce. = 0? The range of the functon goes beyond Opton : We can use sgmod functon to do squeeze the output of a lnear model to the range between 0 and : f(;θ) = g( regresson. = 0 θ ). If g(z) = e z /(+e z ), optmzng f(;θ) s called logstc

Soluton: Logstc regresson can be solved by mnmzng SE. Dervatve SE/ s SE = ( y f ( ; θ)) = θ = g' k k k 0 ote: Solvng θ SE = 0 results n (+) nonlnear equatons wth (+) unknowns optmzaton can be done by usng gradent descent algorthm. amum Lkelhood (L) Algorthm Basc Idea: Gven a data set D and a parametrc model wth parameters θ that descrbes the data generatng process, the best soluton θ* s the one that mamzes P(D θ),.e. θ* = arg ma θ P(D θ) P(D θ) s called the lkelhood, so the name of the algorthm that fnds the optmal soluton θ* s called the mamum lkelhood algorthm. Ths dea can be appled for both unsupervsed and supervsed learnng problems. L for Unsupervsed Learnng: Densty Estmaton Gven D = {, =,, }, and assumng the functonal form p( θ) of the data generatng process, the goal s to estmate the optmal parameters θ that mamze lkelhood P(D θ): P(D θ) = P(,,, θ) By assumng that data ponts are ndependent and dentcally dstrbuted (d) P(D θ) = = p( θ) (p s the probablty densty functon.) Snce log() s monotoncally ncreasng functon wth, mamzaton of P(D θ) s equvalent to mamzaton of l = log(p(d θ)). l s called the log-lkelhood. So, l = log = ( p( ) θ) Eample: Data set D = {, =,, } s drawn from a Gaussan dstrbuton wth mean µ and standard devaton σ,.e., X ~ (µ,σ ). Therefore, ( µ ) ( µ, σ ) = σ e p ( µ ) l = log πσ = πσ σ Values µ and σ that mamze the log-lkelhood satsfy the necessary condton for local optmum: l = µ = l 0 ˆ µ =, = σ = = ( µ ) σ 0 ˆ ˆ

L for Supervsed Learnng Gven D = {(,y ), =,, }, and assumng the functonal form p(y,θ) of the data generatng process, the goal s to estmate the optmal parameters θ that mamze lkelhood P(D θ): P(D θ) = P(y, y,, y,,,,θ) = /f data s d = p( y, θ) = L for Regresson Assume the data generatng process corresponds to: y = f (, θ) + e, where e ~ (µ,σ ) ote: ths s a relatvely strong assumpton! y ~ ( f (, θ), σ ) ( f (, θ)) p ( y, θ) = e σ πσ l = log P( D θ ) = log = ( y f (, θ)) πσ σ = Snce σ s a constant, mamzaton of l s equvalent to mnmzaton of ( y f (, θ )) Important concluson: Regresson usng L under the assumpton of DGP wth addtve Gaussan nose s equvalent to regresson usng SE mnmzaton!!