Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Size: px

Start display at page:

Download "Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010."

Miranda Freeman
5 years ago
Views:

Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier?

1 Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen What is classification? What is a linear classifier? What are Direct Models? How to deal with non-linear problems? How to tackle structured problems? Agenda 9AM Classification Linear classification Probabilistic models Generative and direct models Non-linear classification Mixture models Feature expansion Risk minimization Structured classification Conditional Random Fields Segmental Conditional Random Fields Assign a label Y to an observation X {X} From: Happiness Anger Surprise Disgust Sadness Fear {Y} [Paul Ekman] Classification (2) Which emotion (Y) is this face (X)? X= INPUT Y= Happiness Anger Surprise Disgust Sadness Fear OUTPUT Classification by machines In applied machine learning, most of the intelligence is in the feature extraction pixels Feature vector Surface covered by teeth Upward curvature of lips Eyebrows-eyes distance Abuse of notation: x in R D 1

Linear decision boundary A hyperplane is drawn through the space of features Decision boundary Features and Parameters Feature vector φ is extracted from x Parameter vector λ defines

function Partition function It doesn t change anything Feature functions? It s a probability measure Probabilities Just a tool, but what a tool!

2 Linear decision boundary A hyperplane is drawn through the space of features Decision boundary Features and Parameters Feature vector φ is extracted from x Parameter vector λ defines the direction λ is independent of x λ defines the classifier Find the trigonometric functions (sin, cos, tan) Probabilistic view Log-linear family (exponential family) Potential function Partition function It doesn t change anything Feature functions? It s a probability measure Probabilities Just a tool, but what a tool! Parameter estimation (log-likelihood) Interpretation (information theory) Combination (Bayes Rule) Inequalities and bounds Maximum entropy principle Given the objective: And the constraints: 9:10AM Prove that p(z) is of log-linear form 2

3 Joint Generative Model Direct and generative models Direct Models Generative Models P(y x) P(x y) or P(x,y) Conditioned on y Just a change in the partition function Engineers Solve problem directly Y has smaller dimension P(y x) is easier to learn When x is observed, y was intended When lips are curved upward, subject must be happy Academics Provide a mechanistic view of the whole world P(x y) is easier to understand and write down When y is intended, x is produced When happy, lips curve upward, and Maximum likelihood estimate for lambda You are given a training set of T pairs (y t, x t ) 9:25AM Convexity of the log likelihood function Prove that the log-likelihood (log p(y x)) is convex in the parameters What does this mean? You want lambda for the following objective: Agenda 9:40AM Non-linear problems Linear classification Probabilistic models Generative and direct models Non-linear classification Mixture models Feature expansion Risk minimization Structured classification Conditional Random Fields Segmental Conditional Random Fields Some problems are not linear The exclusive or problem? [Minsky & Papert: Perceptrons] 3

Non-linear problems Most problems are not quite linear 90% rule Mixture models Implement the soft or operator Linear interpolation of log-linear models

in the function family of the model E.g.

4 Non-linear problems Most problems are not quite linear 90% rule Mixture models Implement the soft or operator Linear interpolation of log-linear models Feature expansion Non-linear expansion of feature vector No change in the function family of the model E.g. Decision trees Features: piecewise constant regions Feature expansion Polynomials & splines Gaussian mixture models (Parzen windows) Potential Empirical Bayes Risk Not all errors are created equal Financial Risk Can earn or lose some amount money Can be wiped out Example: Cost to administering medicine Cost to refusing medicine Finite amount of medicine [Carl Menger] 4

5 Popular objective functions Log-likelihood Training Error Empirical Bayes Risk Margin (SVM) Weighted likelihood A word of caution Non-linear models are complex Always try the out-of-the-box solution first Decision trees Gaussian mixtures Model mixtures Most problems with underlying physical processes are approximately linear It is impossible to move your tongue in quantum jumps Don t go crazy with the model It s all the same thing To a large extent, model, features, and objective function are interchangeable E.g.: Gaussian is a linear model with quadratic features Least squares is maximum likelihood in a Gaussian model 9:50AM Agenda >10AM Least squares as function optimization Gaussian And Quadratic features Prove that a Gaussian is a log-quadratic model Gaussian and least squares Prove that the ML estimate for the means is a least-squares problem Linear classification Probabilistic models Generative and direct models Non-linear classification Mixture models Feature expansion Risk minimization Structured classification Conditional Random Fields Segmental Conditional Random Fields 5

6 Structured classification Structured refers to the output (y) Structured Input: Anything which has variable size Speech (long and short utterances) Conditional Random Fields Graph classification Let us have graphs for y and x Need to make a decision about (y x) I want to know: p(y 1 x) < p(y 2 x)? y 1 y 1 Structured output Word sentences Parse tree x Graph classification The Markov assumption The answer is in the product space y Joint y 1,x 2,B 1,A 3,B x A B 4,C 4,D C D Why not just enumerate all y s? y k does not depend on what happens elsewhere The topology of y allows us to factorize the potential functions For efficiency and cognitive tractability Driven by problem topology Markov assumption (2) Markov assumption (3) Product graph: y 1,x 1,A 2,B 3,B Unfactored: (exponential # of paths) 1,A; 2,B; 4,C 1,A; 3,B; 4,C 1,A; 2,B; 4,D 1,A; 3,B; 4,D 4,C 4,D y 1 1 x A 3 B 2 4 C D Consider a sequence of two symbols y 1 and y 2 Unfactored: Markov : 6

7 Take-home message Classification works by turning observations into a fixed-dimension feature vector The linear classifier is the mother of all classifiers Log-linear probabilistic interpretation is useful Structured classification relies on Markov assumptions for efficiency Off-the-shelf non-linear expansions are useful The magic is in the features! 7

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input