Loss Functions, Decision Theory, and Linear Models

Similar documents
INTRODUCTION TO DATA SCIENCE

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Recap from previous lecture

Classification, Linear Models, Naïve Bayes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naïve Bayes, Maxent and Neural Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Logistic Regression. Machine Learning Fall 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes classification

CS 6375 Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Applied Natural Language Processing

ECE521 week 3: 23/26 January 2017

Lecture 2 Machine Learning Review

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Learning (II)

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Foundations of Machine Learning

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bias-Variance Tradeoff

Dan Roth 461C, 3401 Walnut

Machine Learning for Signal Processing Bayes Classification and Regression

Logistic Regression & Neural Networks

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian Mixture Models

Machine Learning Gaussian Naïve Bayes Big Picture

CS 188: Artificial Intelligence Spring Announcements

Machine Learning. Lecture 9: Learning Theory. Feng Li.

EE 511 Online Learning Perceptron

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 6. Notes on Linear Algebra. Perceptron

Generative v. Discriminative classifiers Intuition

Decision trees COMS 4771

CS-E3210 Machine Learning: Basic Principles

6.036 midterm review. Wednesday, March 18, 15

Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Logistic Regression. COMP 527 Danushka Bollegala

CMU-Q Lecture 24:

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Machine Learning Lecture 7

Discriminative Models

Support Vector Machines

Generative v. Discriminative classifiers Intuition

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Logic and machine learning review. CS 540 Yingyu Liang

Bayesian Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning (CS 567) Lecture 2

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

IFT Lecture 7 Elements of statistical learning theory

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Overfitting, Bias / Variance Analysis

ECE 5984: Introduction to Machine Learning

Linear Models for Classification

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Linear classifiers: Logistic regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm: CS 6375 Spring 2015 Solutions

Be able to define the following terms and answer basic questions about them:

Part of the slides are adapted from Ziko Kolter

Support vector machines Lecture 4

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Notes on Discriminant Functions and Optimal Classification

Lecture 2: Logistic Regression and Neural Networks

Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning for NLP

Gaussian and Linear Discriminant Analysis; Multiclass Classification

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Online Learning and Sequential Decision Making

Announcements - Homework

Discriminative Models

CSE446: Clustering and EM Spring 2017

Regression with Numerical Optimization. Logistic

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Warm up: risk prediction with logistic regression

Support Vector Machines

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Transcription:

Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash

Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678 Course site: https://www.csee.umbc.edu/courses/graduate/ 678/spring18/ Evaluation submission site: https://www.csee.umbc.edu/courses/graduate/ 678/spring18/submit

Course Announcement: Assignment 1 Due Wednesday, 2/7 (~7 days) Math & programming review Discuss with others, but write, implement and complete on your own

Recap from last time

What does it mean to learn? Generalization

Machine Learning Framework: Learning Gold/correct labels instance 1 scoring model score θ (X) instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge give feedback to the predictor F(θ) objective

Gradient Ascent

Model, parameters and hyperparameters Model: mathematical formulation of system (e.g., classifier) Parameters: primary knobs of the model that are set by a learning algorithm Hyperparameter: secondary knobs http://www.uiparade.com/wp-content/uploads/2012/01/ui-design-pure-css.jpg

A Terminology Buffet Classification Fully-supervised Probabilistic Neural Regression Semi-supervised Generative Conditional Memorybased Exemplar Clustering Un-supervised Spectral the task: what kind of problem are you solving? the data: amount of human input/number of labeled examples the approach: how any data are being used

Classification: Supervised Machine Learning Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis Input: an instance d a fixed set of classes C = {c 1, c 2,, c J } A training set of m hand-labeled instances (d 1,c 1 ),...,(d m,c m ) Output: a learned classifier γ that maps instances to classes γ learns to associate certain features of instances with their labels

Classification Example: Face Recognition What is a good representation for images? Courtesy from Hamed Pirsiavash Pixel values? Edges?

Ingredients for classification Inject your knowledge into a learning system Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

Ingredients for classification Inject your knowledge into a learning system Problem specific Labeling data == $$$ Difficult to learn from bad ones Sometimes data is available for free Feature representation Training data: labeled examples Model Courtesy Hamed Pirsiavash

Ingredients for classification Inject your knowledge into a learning system Problem specific Difficult to learn from bad ones Feature representation Labeling data == $$$ Sometimes data is available for free Training data: labeled examples No single learning algorithm is always good ( no free lunch ) Different learning algorithms work differently Model Courtesy Hamed Pirsiavash

Sequence & Structured Prediction Courtesy Hamed Pirsiavash

Regression Like classification, but real-valued

Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash

Unsupervised learning: Clustering Courtesy Hamed Pirsiavash

Inductive Bias What do we know before we see the data? Courtesy Hamed Pirsiavash

Inductive Bias What do we know before we see the data? A B C D Partition these into two groups Courtesy Hamed Pirsiavash

Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge

Machine Learning Framework Gold/correct labels instance 1 instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 instances are typically examined independently Extra-knowledge Other ML task (or consumerfacing product)

Today s Goal: Optimize Empirical Risk of Surrogate Loss NN yy ii = ww TT xx ii + bb formulate a linear prediction model argmin h ii=1 l yy ii, h θθ xx ii F learn about empirical risk minimization θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii approximate 0-1 classification loss in a computable way

(Most) Probability Axioms p(everything) = 1 p(φ) = 0 p(a) p(b), when A B A B p(a B) = p(a) + p(b), when A B = φ everything p(a B) p(a) + p(b) p(a B) = p(a) + p(b) p(a B)

Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) Conditional Probabilities are Probabilities

Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY =?

Conditional Probability pp XX YY) = pp(xx & YY) pp(yy) pp YY = pp(xx & YY)dddd

Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y pp yy = xx = xx pp(xx, yy) pp xx pp yy xx)

Bayes Rule pp XX YY) = posterior probability likelihood pp YY XX) pp(xx) pp(yy) marginal likelihood (probability) prior probability

Maximum A Posteriori Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class pp XX YY) = observed data pp YY XX) pp(xx) pp(yy) observation likelihood (averaged over all classes)

Classify with Bayes Rule argmax XX pp XX YY)

Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy)

Classify with Bayes Rule argmax XX pp YY XX) pp(xx) pp(yy) constant with respect to X

Classify with Bayes Rule argmax XX pp YY XX) pp(xx)

Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)

Classify with Bayes Rule argmax XX log pp YY XX) + log pp(xx)

Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X?

Classify with Bayes Rule how likely is label X overall? argmax XX log pp YY XX) + log pp(xx) how well does text Y represent label X? For simple or flat labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

Conditional Probabilities: Changing the Left 1 p(a) what happens as we add conjuncts to the left? 0

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) 0 what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) 0 what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Left 1 p(a) p(a, B) p(a, B, C) p(a, B, C, D) 0 p(a, B, C, D, E) what happens as we add conjuncts to the left?

Conditional Probabilities: Changing the Right 1 p(a) what happens as we add conjuncts to the right? 0

Conditional Probabilities: Changing the Right 1 p(a) p(a B) 0 what happens as we add conjuncts to the right?

Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0

Conditional Probabilities: Changing the Right 1 p(a B) p(a) what happens as we add conjuncts to the right? 0

Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable

Probability Chain Rule pp xx 1, xx 2,, xx SS = pp xx 1 pp xx 2 xx 1 )pp xx 3 xx 1, xx 2 ) pp xx SS xx 1,, xx ii = SS ii pp xx ii xx 1,, xx ii 1 ) extension of Bayes rule

Expected Value random variable XX ~ pp EE ff(xx) = ff(xx) pp xx expected value (distribution p is implicit) xx

Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

Expected Value: Example non-uniform distribution of number of cats I have 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

Probability Prerequisites Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Bayes rule Probability chain rule Expected Value (of a function) of a Random Variable Definition of conditional probability

Loss Functions and Decision Theory FORMALIZING LEARNING

Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y

Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y

Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a function l(y, y ) telling us how wrong we are

Decision Theory Decision theory is trivial, apart from the computational details MacKay, ITILA, Ch 36 Input: x ( state of the world ) Output: a decision y Requirement 1: a decision (hypothesis) function h(x) to produce y Requirement 2: a loss function l(y, y ) telling us how wrong we are Goal: minimize our expected loss across any possible input

Requirement 1: Decision Function instance 1 h(x) Gold/correct labels instance 2 instance 3 Machine Learning Predictor Evaluator score instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

Requirement 2: Loss Function ell (fancy l character) predicted label/result l yy, yy 0 optimize l? minimize maximize correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y

Requirement 2: Loss Function ell (fancy l character) predicted label/result - l is called a utility or reward function l yy, yy 0 correct label/result loss: A function that tells you how much to penalize a prediction y from the correct answer y

Decision Theory minimize expected loss across any possible input arg min yy EE[l(yy, yy)]

Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] a particular, unspecified input pair (x,y) but we want any possible pair

Decision Theory minimize expected loss across any possible input arg min EE[l(yy, yy)] = yy arg min EE[l(yy, h(xx))] = h argmin EE xx,yy PP l yy, h xx h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy)

Risk Minimization minimize expected loss across any possible input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx = argmin h l yy, h xx PP xx, yy dd(xx, yy) we don t know this distribution*! *we could try to approximate it analytically

Empirical Risk Minimization minimize expected loss across our observed input arg min yy EE[l(yy, yy)] = arg min h EE[l(yy, h(xx))] = argmin h EE xx,yy PP l yy, h xx NN 1 argmin h NN ii=1 l yy ii, h xx ii

Empirical Risk Minimization minimize expected loss across our observed input NN argmin h our classifier/predictor controlled by our parameters θ ii=1 l yy ii, h xx ii change θ change the behavior of the classifier

Best Case: Optimize Empirical Risk with Gradients NN argmin h ii=1 l yy ii, h θθ xx ii F θθ FF = ii l yy ii, yy = h θθ xx ii yy θθ h θθ xx ii differentiating might not always work: apart from the computational details

Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy

Loss Function Example: 0-1 Loss l yy, yy = 0, if yy = yy 1, if yy yy Problem: not differentiable wrt yy (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work Solution 2: is h(x) a conditional distribution p(y x)? Use MAP

Loss Function Examples squared loss l yy, yy = y yy 2 absolute loss l yy, yy = yy yy