x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Similar documents
IAML: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Pattern Recognition 2014 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Part 3 Introduction to statistical classification techniques

Elements of Machine Intelligence - I

Support-Vector Machines

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

The blessing of dimensionality for kernel methods

What is Statistical Learning?

Five Whys How To Do It Better

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Tree Structured Classifier

Linear Classification

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Lecture 8: Multiclass Classification (I)

Trigonometric Ratios Unit 5 Tentative TEST date

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Kinematic transformation of mechanical behavior Neville Hogan

Differentiation Applications 1: Related Rates

Feedforward Neural Networks

I.S. 239 Mark Twain. Grade 7 Mathematics Spring Performance Task: Proportional Relationships

CS 109 Lecture 23 May 18th, 2016

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Simple Linear Regression (single variable)

Support Vector Machines and Flexible Discriminants

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Distributions, spatial statistics and a Bayesian perspective

Statistics, Numerical Models and Ensembles

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Equilibrium of Stress

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Linear programming III

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Chapter 3 Kinematics in Two Dimensions; Vectors

, which yields. where z1. and z2

Activity Guide Loops and Random Numbers

Image Processing 1 (IP1) Bildverarbeitung 1

Chapter 3 Digital Transmission Fundamentals

Lecture 5: Equilibrium and Oscillations

AP Physics Kinematic Wrap Up

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

B. Definition of an exponential

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Checking the resolved resonance region in EXFOR database

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Hypothesis Tests for One Population Mean

Hiding in plain sight

SPH3U1 Lesson 06 Kinematics

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

NUMBERS, MATHEMATICS AND EQUATIONS

AP Statistics Notes Unit Two: The Normal Distributions

Artificial Neural Networks MLP, Backpropagation

Physics 2010 Motion with Constant Acceleration Experiment 1

6.3: Volumes by Cylindrical Shells

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Inference in the Multiple-Regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Lecture 20a. Circuit Topologies and Techniques: Opamps

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Kinetic Model Completeness

Sequential Allocation with Minimal Switching

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

ENG2410 Digital Design Sequential Circuits: Part B

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Math Foundations 20 Work Plan

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

3. Classify the following Numbers (Counting (natural), Whole, Integers, Rational, Irrational)

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Plan o o. I(t) Divide problem into sub-problems Modify schematic and coordinate system (if needed) Write general equations

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Computational modeling techniques

Thermodynamics Partial Outline of Topics

Figure 1a. A planar mechanism.

Chapter 2 GAUSS LAW Recommended Problems:

Transcription:

Outline IAML: Lgistic Regressin Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester Lgistic functin Lgistic regressin Learning lgistic regressin Optimizatin The pwer f nn-linear basis functins Least-squares classificatin Generative and discriminative mdels Relatinships t Generative Mdels Multiclass classificatin Reading: W & F 4.6 (but pairwise classificatin, perceptrn learning rule, Winnw are nt required) / 24 2 / 24 Decisin Bundaries Eample Data 2 In this class we will discuss linear classifiers. Fr each class, there is a regin f feature space in which the classifier The decisin bundary is the bundary f this regin. (i.e., Where the tw classes are tied ) In linear classifiers the decisin bundary is a line. 3 / 24 4 / 24

Linear Classifiers A Gemetric View 2 2 In a tw-class linear classifier, we learn a functin F (, w) = w + w that represents hw aligned the instance is with y =. w are parameters f the classifier that we learn frm data. T d predictin f an input : w (y = ) if F(, w) > 5 / 24 6 / 24 Eplanatin f Gemetric View Tw Class Discriminatin The decisin bundary in the previus case is { w + w = } w is a nrmal vectr t this surface (Remember hw lines can be written in terms f their nrmal vectr.) Ntice that in mre than 2 dimensins, this bundary will be a hyperplane. Fr nw cnsider a tw class case: y {, }. Frm nw n we ll write = (,, 2,... d ) and w = (w, w,... d ). We will want a linear, prbabilistic mdel. We culd try P(y = ) = w. But this is stupid. Instead what we will d is P(y = ) = f (w ) f must be between and. It will squash the real line int [, ] Furthermre the fact that prbabilities sum t ne means P(y = ) = f (w ) 7 / 24 8 / 24

The The lgistic lgistic functin functin We need a functin that returns prbabilities (i.e. stays between We need aand functin ). that returns prbabilities (i.e. stays between and ). The lgistic functin prvides this The lgistic functin prvides this f (z) = σ(z) /( ep( z)). f (z) =σ(z) /( + ep( z)). As z ges frm t, s ges frm t, As z ges frm t, s f ges frm t, a squashing squashing functin functin It It has hasa a sigmid shape (i.e. (i.e. S-like shape).9.8.7.6.5.4.3.2. Linear weights Linear weights + lgistic squashing functin == lgistic regressin. We mdel the class prbabilities as p(y = ) = σ( D w j j ) = σ(w T ) σ(z) =.5 when z =. Hence the decisin bundary is given by w T + w =. j= Decisin bundary is a M hyperplane fr a M dimensinal prblem. 6 4 2 2 4 6 6 / 24 9 / 24 / 24 Lgistic regressin Learning Lgistic Regressin Fr this slide write w = (w, w 2,... w d ) (i.e., eclude the bias w ) The bias parameter w shifts the psitin f the hyperplane, but des nt alter the angle The directin f the vectr w affects the angle f the hyperplane. The hyperplane is perpendicular t w The magnitude f the vectr w effects hw certain the classificatins are Fr small w mst f the prbabilities within a regin f the decisin bundary will be near t.5. Fr large w prbabilities in the same regin will be clse t r. Want t set the parameters w using training data. As befre: Write ut the mdel and hence the likelihd Find the derivatives f the lg likelihd w.r.t the parameters. Adjust the parameters t maimize the lg likelihd. / 24 2 / 24

Assume data is independent and identically distributed. Call the data set D = {(, y ), ( 2, y 2 ),... ( n, y n )} The likelihd is p(d w) = = n p(y = y i i, w) n p(y = i, w) y i ( p(y = i, w)) y i Hence the lg likelihd L(w) = lg p(d w) is given by It turns ut that the likelihd has a unique ptimum (given sufficient training eamples). It is cnve. Hw t maimize? Take gradient L = n (y i σ(w T i )) ij (Aside: smething similar hlds fr linear regressin E = n (w T φ( i ) y i ) ij L(w) = n y i lg σ(w i ) + ( y i ) lg( σ(w i )) where E is squared errr.) Unfrtunately, yu cannt maimize L(w) eplicitly as fr linear regressin. Yu need t use a numerical methd (see net lecture). 3 / 24 4 / 24 Gemetric Intuitin f Gradient Gemetric Intuitin f Gradient One training pint, y =. Let s say there s nly ne training pint D = {(, y )}. Then L = (y σ(w )) j Als assume y =. (It will be symmetric fr y =.) Nte that (y σ(w )) is always psitive because σ(z) < fr all z. There are three cases: If is classified as right answer with high cnfidence, e.g., σ(w ) =.99 If is classified wrng, e.g., (σ(w ) =.2) If is classified crrectly, but just barely, e.g., σ(w ) =.6. L = (y σ(w )) j Remember: gradient is directin f steepest increase. We want t maimize, s let s nudge the parameters in the directin L If σ(w ) is crrect, e.g.,.99 Then (y σ(w )) is nearly, s we dn t change w j. If σ(w ) is wrng, e.g.,.2 This means w is negative. It shuld be psitive. The gradient has the same sign as j If we nudge w j, then w j will tend t increase if j > r decrease if j <. Either way w ges up! If σ(w ) is just barely crrect, e.g.,.6 Same thing happens as if we were wrng, just mre slwly. 6 / 24 5 / 24

Fitting this int the general structure fr learning algrithms: Define the task: classificatin, discriminative Decide n the mdel structure: lgistic regressin mdel Decide n the scre functin: lg likelihd Decide n ptimizatin/search methd t ptimize the scre functin: numerical ptimizatin rutine. Nte we have several chices here (stchastic gradient descent, cnjugate gradient, BFGS). XOR and Linear Separability XOR and Linear Separability A prblem is linearly separable if we can find weights s that A w T prblem is linearly separable if we can find weights s that + w > fr all psitive cases (where y = ), and w T + w fr all negative cases (where y = ) w T + w > fr all psitive cases (where y = ), and w T + w fr all negative cases (where y = ) XOR, a failure fr the perceptrn XOR, a failure fr the perceptrn XOR can be slved by a perceptrn using a nnlinear XORtransfrmatin can be slved φ() byfathe perceptrn input; can yu using find ne? a nnlinear transfrmatin φ() f the input; can yu find ne? / 24 7 / 24 8 / 24 The pwer f nn-linear basis functins Generative and Discriminative Mdels 2 Using tw Gaussian basis functins φ () and φ 2 () φ 2.5.5 φ Figure credit: Chris Bishp, PRML As fr linear regressin, we can transfrm the input space if we want φ() 9 / 24 Ntice that we have dne smething very different here than with naive Bayes. Naive Bayes: Mdelled hw a class generated the feature vectr p( y). Then culd classify using p(y ) p( y)p(y). This called is a generative apprach. Lgistic regressin: Mdel p(y ) directly. This is a discriminative apprach. Discriminative advantage: Why spent effrt mdelling p()? Seems a waste, we re always given it as input. Generative advantage: Can be gd with missing data (remember hw naive Bayes handles missing data). Als gd fr detecting utliers. Or, smetimes yu really d want t generate the input. 2 / 24

Generative Classifiers can be Linear T Multiclass classificatin Tw scenaris where naive Bayes gives yu a linear classifier.. Gaussian data with equal cvariance. If p( y = ) N(µ, Σ) and p( y = ) N(µ 2, Σ) then p(y = ) = σ( w T + w ) fr sme (w, w) that depends n µ, µ 2, Σ and the class prirs 2. Binary data. Let each cmpnent j be a Bernulli variable i.e. j {, }. Then a Naïve Bayes classifier has the frm p(y = ) = σ( w T + w ) Create a different weight vectr w k fr each class Then use the sftma functin p(y = k ) = ep(w T k ) C j= ep(wt j ) Nte that p(y = k ) and C j= p(y = j ) = This is the natural generalizatin f lgistic regressin t mre than 2 classes. 3. Eercise fr keeners: prve these tw results 2 / 24 22 / 24 Least-squares classificatin Summary Lgistic regressin is mre cmplicated algrithmically than linear regressin Why nt just use linear regressin with / targets? 4 2 2 4 6 4 2 2 4 6 The lgistic functin, lgistic regressin Hyperplane decisin bundary The perceptrn, linear separability We still need t knw hw t cmpute the maimum f the lg likelihd. Cming sn! 8 4 2 2 4 6 8 8 4 2 2 4 6 8 Green: lgistic regressin; magenta, least-squares regressin Figure credit: Chris Bishp, PRML 23 / 24 24 / 24