Logistic Classifier CISC 5800 Professor Daniel Leeds

Similar documents
Bayesian classification CISC 5800 Professor Daniel Leeds

Support Vector Machines

Support Vector Machines

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Evaluation of classifiers MLPs

Multilayer Perceptron (MLP)

Multilayer neural networks

Multi-layer neural networks

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Homework Assignment 3 Due in class, Thursday October 15

Linear Classification, SVMs and Nearest Neighbors

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Week 5: Neural Networks

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Classification as a Regression Problem

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Discriminative classifier: Logistic Regression. CS534-Machine Learning

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture Notes on Linear Regression

Support Vector Machines

Kernel Methods and SVMs Extension

Support Vector Machines

Maximum Likelihood Estimation (MLE)

Classification learning II

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Maximum Likelihood Estimation

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Error Bars in both X and Y

1 Convex Optimization

Lecture 10 Support Vector Machines II

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Generalized Linear Methods

Generative classification models

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$15:$LogisAc$Regression$/$ GeneraAve$vs.$DiscriminaAve$$

CSE 546 Midterm Exam, Fall 2014(with Solution)

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

The big picture. Outline

18-660: Numerical Methods for Engineering Design and Optimization

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Ensemble Methods: Boosting

Learning from Data 1 Naive Bayes

CSC 411 / CSC D11 / CSC C11

Limited Dependent Variables

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Which Separator? Spring 1

Maxent Models & Deep Learning

EEE 241: Linear Systems

The Geometry of Logit and Probit

Learning Theory: Lecture Notes

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

CSCI B609: Foundations of Data Science

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Logistic Regression Maximum Likelihood Estimation

The exam is closed book, closed notes except your one-page cheat sheet.

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Feature Selection: Part 1

Online Classification: Perceptron and Winnow

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Evaluation for sets of classes

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Clustering gene expression data & the EM algorithm

Negative Binomial Regression

Neural networks. Nuno Vasconcelos ECE Department, UCSD

SDMML HT MSc Problem Sheet 4

Retrieval Models: Language models

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

UVA CS / Introduc8on to Machine Learning and Data Mining

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Semi-Supervised Learning

Review: Fit a line to N data points

Fundamentals of Neural Networks

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Lecture 12: Classification

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Statistical machine learning and its application to neonatal seizure detection

MATH 567: Mathematical Techniques in Data Science Lab 8

Support Vector Machines

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Expectation Maximization Mixture Models HMMs

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

15-381: Artificial Intelligence. Regression and cross validation

Maximal Margin Classifier

Conjugacy and the Exponential Family

Transcription:

lon 9/7/8 Logstc Classfer CISC 58 Professor Danel Leeds Classfcaton strategy: generatve vs. dscrmnatve Generatve, e.g., Bayes/Naïve Bayes: 5 5 Identfy probablty dstrbuton for each class Determne class wth maxmum probablty for data example Dscrmnatve, e.g., Logstc Regresson: Identfy boundary between classes Determne whch sde of boundary new data example exsts on 5 5 Lnear algebra: data features Document Document Document 3 Feature space Vector lst of numbers: each number descrbes a data feature Matrx lst of lsts of numbers: features for each data pont Wolf Lon 6 Monkey 4 Broker Analyst Dvdend d 8 # of word occurrences 4 3 Each data feature defnes a dmenson n space Document Document Document3 Wolf 8 Lon 6 Monkey 4 Broker 4 Analyst Dvdend d doc doc doc3 wolf 4

9/7/8 The dot product The dot product compares two vectors: a b a =, b = a b = n = a b = a T b a n b n 5 = 5 + = 5 + = 5 6 The dot product, contnued Magntude of a vector s the sum of the squares of the elements a = a If a has unt magntude, a b s the projecton of b onto a a b =.6.8.5.6.8.5 n = a b =.6.5 +.8 =.9 +.8 =.7 =.6 +.8.5 = +.4 =.4 9 Separatng boundary, defned by w Separatng boundary, defned by w and b Separatng hyperplane splts class and class Example: w = b=-4 Plane s defned by lne w perpendcular to plane x = 5 w T x +b= Is data pont x n class or class? w T x+b > class w T x+b < class b< x = w T x +b=-

9/7/8 Notatonal smplfcaton Recall: w T x = w x = n = w x Defne x :n = x :n and x n+ = for all nputs x and w :n = w :n and w n+ = b Now w T x = w T x + b Let s assume x n+ = always, and w n+ =b always From real-number projecton to / label Bnary classfcaton: s class A, s class B Sgmod functon stands n for p(x y) g(h) Sgmod: g h = +e h p y = x; θ = g w T x = p y = x; θ = g w T x = e wt x +e wt x +e wt x.5-5 5 h w T x = j w j x j + b 4 Learnng parameters for classfcaton Smlar to MLE for Bayes classfer Lkelhood for data ponts y,, y n (dfferent from Bayesan lkelhood) If y n class A, y =, multply (-g(x ;w)) If y n class B, y =, multply (g(x ;w)) Learnng parameters for classfcaton Smlar to MLE for Bayes classfer Lkelhood for data ponts y,, y n (dfferent from Bayesan lkelhood) If y n class A, y =, multply (-g(x ;w)) If y n class B, y =, multply (g(x ;w)) argmax L y x; w = w g x ; w ( y ) g x ; w y argmax L y x; w = w g x ; w ( y ) g x ; w y 5 ( y ) log g x ; w + y log g x ; w y log g x ; w g x ; w + log g x ; w 6 3

9/7/8 Learnng parameters for classfcaton y g x ; w log g x ; w + log g x ; w y log + e wt x e wt x + log + e wt x + e wt x y log + e wt x + log e wtx + e wt x w T x = Learnng parameters for classfcaton g h = + e h g h = y w T x w T x + log g w T x y w T x w T x log + e wt x 7 w j w j x j w j y x j x j + x j e wt x + e wt x y ( g(w T x ) ) x j y g(w T x ) j e h + e h 8 w j x j Iteratve gradent ascent Begn wth ntal guessed weghts w For each data pont (y,x ), update each weght w j w j w j + εx j y g(w T x ) Choose ε so change s not too bg or too small step sze x j y g(w T x ) y true data label g(w T x ) computed data label Intuton If y = and g(w T x )=, and x j>, make w j larger and push w T x to be larger If y = and g(w T x )=, and x j>, make w smaller and push w T x to be smaller Iteratve gradent ascent bg pcture Intalze w wth random values Loop across all tranng data x for each feature x j Repeat ths loop many tmes (x, or x, etc.) 9 4

9/7/8 Gradent ascent for L(y x;w) Typcal gradent ascent can get stuck n local maxma L y x; w = g x ; w ( y ) g x ; w y L s convex t has at most maxmum MAP for dscrmnatve classfer MLE: P(y= x;w) ~ g(w T x) MAP: P(y=,w x) P(y= x;w) P(w) ~ g(w T x)??? (dfferent from Bayesan posteror) P(w) prors L regularzaton mnmze all weghts L regularzaton mnmze number of non-zero weghts L and L update rules L: w j w j + εx j L: w j w j + εx j y g(w T x ) sgn w j λ y g(w T x ) w j λ sgn w j = + f w j > f w j < Thnkng about your data: numerc and non-numerc features Data to be classfed can have multple features x = x x n Each feature could be: Numerc: Loudness of musc, from to 3 decbels Non-numerc: Acton, ncludng Laugh, Cry, Jump, Dance 5

9/7/8 Classfer choce Logstc regresson only makes sense for numerc data Gaussan Bayes only makes sense for numerc data Multnomal Bayes makes sense for non-numerc data Non-numerc features -> numerc You may map non-numerc features to contnuous space Example: Mood={Depressed, Dsapponted, Neutral, Happy, Excted} Swtch to: HappnessLevel = {-, -,,, } 6