Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash

Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678 Course site: https://www.csee.umbc.edu/courses/graduate/ 678/spring18/ Evaluation submission site: https://www.csee.umbc.edu/courses/graduate/ 678/spring18/submit

Course Announcement: Assignment 1 Due Wednesday, 2/7 (~7 days) Math & programming review Discuss with others, but write, implement and complete on your own

Recap from last time

What does it mean to learn? Generalization

Machine Learning Framework: Learning Gold/correct labels instance 1 scoring model score θ (X) instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge give feedback to the predictor F(θ) objective

Gradient Ascent

Model, parameters and hyperparameters Model: mathematical formulation of system (e.g., classifier) Parameters: primary knobs of the model that are set by a learning algorithm Hyperparameter: secondary knobs http://www.uiparade.com/wp-content/uploads/2012/01/ui-design-pure-css.jpg

A Terminology Buffet Classification Fully-supervised Probabilistic Neural Regression Semi-supervised Generative Conditional Memorybased Exemplar Clustering Un-supervised Spectral the task: what kind of problem are you solving? the data: amount of human input/number of labeled examples the approach: how any data are being used

Classification: Supervised Machine Learning Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis Input: an instance d a fixed set of classes C = {c 1, c 2,, c J } A training set of m hand-labeled instances (d 1,c 1 ),...,(d m,c m ) Output: a learned classifier γ that maps instances to classes γ learns to associate certain features of instances with their labels

Classification Example: Face Recognition What is a good representation for images? Courtesy from Hamed Pirsiavash Pixel values? Edges?

Ingredients for classification Inject your knowledge into a learning system Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

Ingredients for classification Inject your knowledge into a learning system Problem specific Labeling data == $$$ Difficult to learn from bad ones Sometimes data is available for free Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Labeling data == $$$ Sometimes data is available for free Training data: labeled examples No single learning algorithm is always good ( no free lunch ) Different learning algorithms work differently Model Courtesy Hamed Pirsiavash

Sequence & Structured Prediction Courtesy Hamed Pirsiavash

Regression Like classification, but real-valued

Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash

Unsupervised learning: Clustering Courtesy Hamed Pirsiavash

Inductive Bias What do we know before we see the data? Courtesy Hamed Pirsiavash

Inductive Bias What do we know before we see the data? A B C D Partition these into two groups Courtesy Hamed Pirsiavash

Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge

Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge Other ML task (or consumerfacing product)

Today s Goal: Optimize Empirical Risk of Surrogate Loss NN yy ii = ww TT xx ii + bb formulate a linear prediction model argmin h ii=1 l yy ii, h θθ xx ii F learn about empirical risk minimization θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii approximate 0-1 classification loss in a computable way

(Most) Probability Axioms p(everything) = 1 p(φ) = 0 p(a) p(b), when A B A B p(a B) = p(a) + p(b), when A B = φ everything p(a B) p(a) + p(b) p(a B) = p(a) + p(b) p(a B)

Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) Conditional Probabilities are Probabilities

Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY =?

Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY = pp(xx & YY)dddd

Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y pp yy = xx = xx pp(xx, yy) pp xx pp yy xx)

Bayes Rule pp XX YY) = posterior probability likelihood pp YY XX) pp(xx) pp(yy) marginal likelihood (probability) prior probability

Maximum A Posteriori Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class pp XX YY) = observed data pp YY XX) pp(xx) pp(yy) observation likelihood (averaged over all classes)

Classify with Bayes Rule argmax XX pp XX YY)

Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy)

Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy) constant with respect to X

Classify with Bayes Rule argmax XX pp YY XX) pp(xx)

Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)

Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X?

Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X? For simple or flat labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

Conditional Probabilities: Changing the Left 1 p(a) what happens as we add conjuncts to the left? 0

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) 0 what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) 0 what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 p(a, B, C, D, E) what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Right 1 p(a) what happens as we add conjuncts to the right? 0

Conditional Probabilities: Changing the Right 1 p(a) p(a B) 0 what happens as we add conjuncts to the right?

Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0

Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable

Probability Chain Rule pp xx 1, xx 2,, xx SS = pp xx 1 pp xx 2 xx 1 )pp xx 3 xx 1, xx 2 ) pp xx SS xx 1,, xx ii = SS ii pp xx ii xx 1,, xx ii 1 ) extension of Bayes rule

Expected Value random variable XX ~ pp EE ff(xx) = ff(xx) pp xx expected value (distribution p is implicit) xx

Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

Expected Value: Example non-uniform distribution of number of cats I have 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

Probability Prerequisites Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Bayes rule Probability chain rule Expected Value (of a function) of a Random Variable Definition of conditional probability

Loss Functions and Decision Theory FORMALIZING LEARNING

Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y

Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a function l(y, y ) telling us how wrong we are

Requirement 1: Decision Function instance 1 h(x) Gold/correct labels instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

Requirement 2: Loss Function ell (fancy l character) predicted label/result l yy, yy 0 optimize l? minimize maximize correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y

Requirement 2: Loss Function ell (fancy l character) predicted label/result - l is called a utility or reward function l yy, yy 0 correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y

Decision Theory minimize expected loss across any possible input arg min yy EE[l(yy, yy)]

Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] a particular, unspecified input pair (x,y) but we want any possible pair

Decision Theory minimize expected loss across any possible input arg min EE[l(yy, yy)] = yy arg min EE[l(yy, h(xx))] = h argmin EE xx,yy PP l yy, h xx h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy)

Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy) we don t know this distribution*! *we could try to approximate it analytically

Empirical Risk Minimization minimize expected loss across our observed input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx NN 1 argmin h NN ii=1 l yy ii, h xx ii

Empirical Risk Minimization minimize expected loss across our observed input NN argmin h our classifier/predictor controlled by our parameters θ ii=1 l yy ii, h xx ii change θ change the behavior of the classifier

Best Case: Optimize Empirical Risk with Gradients NN argmin h ii=1 l yy ii, h θθ xx ii F θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii differentiating might not always work: apart from the computational details

Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy

Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy Problem: not differentiable wrt yy (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work Solution 2: is h(x) a conditional distribution p(y x)? Use MAP

Loss Function Examples squared loss l yy, yy = y yy 2 absolute loss l yy, yy = yy yy