Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash
Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678 Course site: https://www.csee.umbc.edu/courses/graduate/ 678/spring18/ Evaluation submission site: https://www.csee.umbc.edu/courses/graduate/ 678/spring18/submit
Course Announcement: Assignment 1 Due Wednesday, 2/7 (~7 days) Math & programming review Discuss with others, but write, implement and complete on your own
Recap from last time
What does it mean to learn? Generalization
Machine Learning Framework: Learning Gold/correct labels instance 1 scoring model score θ (X) instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge give feedback to the predictor F(θ) objective
Gradient Ascent
Model, parameters and hyperparameters Model: mathematical formulation of system (e.g., classifier) Parameters: primary knobs of the model that are set by a learning algorithm Hyperparameter: secondary knobs http://www.uiparade.com/wp-content/uploads/2012/01/ui-design-pure-css.jpg
A Terminology Buffet Classification Fully-supervised Probabilistic Neural Regression Semi-supervised Generative Conditional Memorybased Exemplar Clustering Un-supervised Spectral the task: what kind of problem are you solving? the data: amount of human input/number of labeled examples the approach: how any data are being used
Classification: Supervised Machine Learning Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis Input: an instance d a fixed set of classes C = {c 1, c 2,, c J } A training set of m hand-labeled instances (d 1,c 1 ),...,(d m,c m ) Output: a learned classifier γ that maps instances to classes γ learns to associate certain features of instances with their labels
Classification Example: Face Recognition What is a good representation for images? Courtesy from Hamed Pirsiavash Pixel values? Edges?
Ingredients for classification Inject your knowledge into a learning system Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash
Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash
Ingredients for classification Inject your knowledge into a learning system Problem specific Labeling data == $$$ Difficult to learn from bad ones Sometimes data is available for free Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash
Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Labeling data == $$$ Sometimes data is available for free Training data: labeled examples No single learning algorithm is always good ( no free lunch ) Different learning algorithms work differently Model Courtesy Hamed Pirsiavash
Sequence & Structured Prediction Courtesy Hamed Pirsiavash
Regression Like classification, but real-valued
Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash
Unsupervised learning: Clustering Courtesy Hamed Pirsiavash
Inductive Bias What do we know before we see the data? Courtesy Hamed Pirsiavash
Inductive Bias What do we know before we see the data? A B C D Partition these into two groups Courtesy Hamed Pirsiavash
Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge
Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge Other ML task (or consumerfacing product)
Today s Goal: Optimize Empirical Risk of Surrogate Loss NN yy ii = ww TT xx ii + bb formulate a linear prediction model argmin h ii=1 l yy ii, h θθ xx ii F learn about empirical risk minimization θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii approximate 0-1 classification loss in a computable way
(Most) Probability Axioms p(everything) = 1 p(φ) = 0 p(a) p(b), when A B A B p(a B) = p(a) + p(b), when A B = φ everything p(a B) p(a) + p(b) p(a B) = p(a) + p(b) p(a B)
Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) Conditional Probabilities are Probabilities
Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY =?
Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY = pp(xx & YY)dddd
Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y pp yy = xx = xx pp(xx, yy) pp xx pp yy xx)
Bayes Rule pp XX YY) = posterior probability likelihood pp YY XX) pp(xx) pp(yy) marginal likelihood (probability) prior probability
Maximum A Posteriori Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class pp XX YY) = observed data pp YY XX) pp(xx) pp(yy) observation likelihood (averaged over all classes)
Classify with Bayes Rule argmax XX pp XX YY)
Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy)
Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy) constant with respect to X
Classify with Bayes Rule argmax XX pp YY XX) pp(xx)
Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)
Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)
Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X?
Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X? For simple or flat labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score
Conditional Probabilities: Changing the Left 1 p(a) what happens as we add conjuncts to the left? 0
Conditional Probabilities: Changing the Left 1 p(a) p(a, B) 0 what happens as we add conjuncts to the left?
Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) 0 what happens as we add conjuncts to the left?
Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 what happens as we add conjuncts to the left?
Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 p(a, B, C, D, E) what happens as we add conjuncts to the left?
Conditional Probabilities: Changing the Right 1 p(a) what happens as we add conjuncts to the right? 0
Conditional Probabilities: Changing the Right 1 p(a) p(a B) 0 what happens as we add conjuncts to the right?
Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0
Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0
Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable
Probability Chain Rule pp xx 1, xx 2,, xx SS = pp xx 1 pp xx 2 xx 1 )pp xx 3 xx 1, xx 2 ) pp xx SS xx 1,, xx ii = SS ii pp xx ii xx 1,, xx ii 1 ) extension of Bayes rule
Expected Value random variable XX ~ pp EE ff(xx) = ff(xx) pp xx expected value (distribution p is implicit) xx
Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5
Expected Value: Example non-uniform distribution of number of cats I have 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
Probability Prerequisites Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Bayes rule Probability chain rule Expected Value (of a function) of a Random Variable Definition of conditional probability
Loss Functions and Decision Theory FORMALIZING LEARNING
Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y
Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y
Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a function l(y, y ) telling us how wrong we are
Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a loss function l(y, y ) telling us how wrong we are Goal: minimize our expected loss across any possible input
Requirement 1: Decision Function instance 1 h(x) Gold/correct labels instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)
Requirement 2: Loss Function ell (fancy l character) predicted label/result l yy, yy 0 optimize l? minimize maximize correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y
Requirement 2: Loss Function ell (fancy l character) predicted label/result - l is called a utility or reward function l yy, yy 0 correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y
Decision Theory minimize expected loss across any possible input arg min yy EE[l(yy, yy)]
Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] a particular, unspecified input pair (x,y) but we want any possible pair
Decision Theory minimize expected loss across any possible input arg min EE[l(yy, yy)] = yy arg min EE[l(yy, h(xx))] = h argmin EE xx,yy PP l yy, h xx h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y
Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy)
Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy) we don t know this distribution*! *we could try to approximate it analytically
Empirical Risk Minimization minimize expected loss across our observed input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx NN 1 argmin h NN ii=1 l yy ii, h xx ii
Empirical Risk Minimization minimize expected loss across our observed input NN argmin h our classifier/predictor controlled by our parameters θ ii=1 l yy ii, h xx ii change θ change the behavior of the classifier
Best Case: Optimize Empirical Risk with Gradients NN argmin h ii=1 l yy ii, h θθ xx ii F θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii differentiating might not always work: apart from the computational details
Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy
Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy Problem: not differentiable wrt yy (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work Solution 2: is h(x) a conditional distribution p(y x)? Use MAP
Loss Function Examples squared loss l yy, yy = y yy 2 absolute loss l yy, yy = yy yy