Lecture 2 Machine Learning Review

Similar documents
Lecture 3 Feedforward Networks and Backpropagation

ECE521 week 3: 23/26 January 2017

Overfitting, Bias / Variance Analysis

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 3 Feedforward Networks and Backpropagation

CMU-Q Lecture 24:

Lecture 3: Introduction to Complexity Regularization

Machine Learning Practice Page 2 of 2 10/28/13

Recap from previous lecture

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Least Squares Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Bias-Variance Tradeoff

Least Squares Regression

Lecture : Probabilistic Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Gradient-Based Learning. Sargur N. Srihari

Parametric Techniques Lecture 3

Classification Logistic Regression

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Machine Learning And Applications: Supervised Learning-SVM

Association studies and regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Statistical Machine Learning Hilary Term 2018

CS229 Supplemental Lecture notes

Parametric Techniques

Warm up: risk prediction with logistic regression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Classifiers. Michael Collins. January 18, 2012

Logistic Regression. Machine Learning Fall 2018

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning for NLP

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Sparse Linear Models (10/7/13)

Tufts COMP 135: Introduction to Machine Learning

Machine Learning

INTRODUCTION TO DATA SCIENCE

Day 4: Classification, support vector machines

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 3: Statistical Decision Theory (Part II)

ECE521 Lecture7. Logistic Regression

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bits of Machine Learning Part 1: Supervised Learning

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Linear Models in Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Bayesian Learning (II)

Statistical Machine Learning Hilary Term 2018

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Generative v. Discriminative classifiers Intuition

IFT Lecture 7 Elements of statistical learning theory

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 3 - Linear and Logistic Regression

y Xw 2 2 y Xw λ w 2 2

COMS 4771 Regression. Nakul Verma

Introduction to Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Day 3: Classification, logistic regression

Today. Calculus. Linear Regression. Lagrange Multipliers

Logistic Regression Logistic

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning Gaussian Naïve Bayes Big Picture

Linear Models for Regression

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

10-701/ Machine Learning - Midterm Exam, Fall 2010

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Machine Learning and Data Mining. Linear regression. Kalev Kask

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

LECTURE NOTE #3 PROF. ALAN YUILLE

Machine Learning Lecture 7

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

MLCC 2017 Regularization Networks I: Linear Models

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Regression (continued)

Probabilistic Graphical Models & Applications

Generative v. Discriminative classifiers Intuition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

ECE 5984: Introduction to Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Statistical modeling: handout for Math 489/583

Linear Models for Regression. Sargur Srihari

CPSC 340: Machine Learning and Data Mining

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning 4771

Support vector machines Lecture 4

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Transcription:

Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017

Things we will look at today Formal Setup for Supervised Learning

Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization

Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression

Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization

Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification

Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification (Time permitting) Start with Feedforward Networks

Note: Most slides in this presentation are adapted from, or taken (with permission) from slides by Professor Gregory Shakhnarovich for his TTIC 31020 course

What is Machine Learning? Right question: What is learning?

What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price.

What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price. How can you specify T, P and E?

Formal Setup (Supervised) Input data space X

Formal Setup (Supervised) Input data space X Output (label) space Y

Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y

Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y

Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification

Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification Continuous Y = Regression

Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R

Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y

Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y Does it make sense to use learning here?

Fitting Function to Data We will approach this in two steps:

Fitting Function to Data We will approach this in two steps: Choose a model class of functions

Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class

Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class Let us begin with considering one of the simpled model classes: linear functions

Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )}

Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set

Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor

Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor

Linear Functions General form: f(x; θ) = θ 0 + θ 1 x 1 +... θ d x d

Linear Functions General form: f(x; θ) = θ 0 + θ 1 x 1 +... θ d x d 1-D case: A line

Linear Functions General form: f(x; θ) = θ 0 + θ 1 x 1 +... θ d x d 1-D case: A line X R 2 : a plane

Linear Functions General form: f(x; θ) = θ 0 + θ 1 x 1 +... θ d x d 1-D case: A line X R 2 : a plane Hyperplane in general d-d case

Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R

Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R

Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y

Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss

Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss Standard choice for regression: L(ŷ, y) = (ŷ y) 2

Empirical Loss Consider a parametric function f(x; θ)

Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x

Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1

Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L

Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples

Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples When does empirical loss minimization help us in doing that?

Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y)

Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data

Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 The goal is to minimize the expected loss, also known as risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]

Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]

Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk

Empirical Risk Minimization Empirical Loss: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk In essence: Estimate p(x, y) by the empirical distribution of the data

Learning via Empirical Loss Minimization Learning is done in two steps:

Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x

Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1

Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1 How do we find θ = [θ 0, θ 1,..., θ d ]?

Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d

Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero

Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d

Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d Let us switch to vector notation for convenience

Learning via Empirical Loss Minimization First let us write down least squares in matrix form:

Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ;

Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ;

Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ)

Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ) What next? Take derivative of L(θ) and set it to zero

Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ)

Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a = bt a a = b and at Ba a = 2Ba

Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ]

Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ]

Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ] = 2 N (XT y X T Xθ)

Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0

Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y

Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X

Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution!

Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution! Prediction: ŷ = θ T [ 1 x 0 ] = y T X T [ 1 x 0 ]

Polynomial Regression Transform x φ(x)

Polynomial Regression Transform x φ(x) For example consider 1D for simplicity:

Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ m x m

Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ m x m Above φ(x) = [1, x, x 2,..., x m ]

Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ!

Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression!

Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression! Generalized Linear models: f(x; θ) = θ 0 + θ 1 φ 1 (x) + θ 2 φ 2 (x) + + θ m φ m (x)

A Short Primer on Regularization

Model Complexity and Overfitting Consider data drawn from a 3rd order model:

Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable

Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set

Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation

Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation What is the source of overfitting?

Cross Validation Example

Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom )

Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit

Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit

Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit Find the model with the right bias-variance balance

Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data

Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1

Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1 Since the parameters might not be independent, we would like to penalize the complexity in a more sophisticated way

Problems

Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters

Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters.

Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1

Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc.

Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc. Regularization is basically a way to implement Occam s Razor

Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1

Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1

Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1 This is called Ridge regression: Closed form solution for squared loss ˆθ ridge = (λi + X T X) 1 X T y!

LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1

LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1

LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution!

LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution! Still convex, but no longer smooth. Solve using Lagrange multipliers!

Effect of λ

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )})

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ))

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean?

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2.

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1

The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1 If P 1 is very small, it means that such a dataset is very unlikely to occur, thus perhaps θ 1 was not a good guess

The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have

The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ)

The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function

The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure:

The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure: Write the log likelihood function: log p(x; θ) (we ll see later why log) Want to maximize - So differentiate log p(x; θ) w.r.t θ and set to zero Solve for θ that satisfies the equation. This is θ ML

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example)

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation:

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation

The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation Not optimal for small sample sizes Can be computationally challenging (numerical methods)

Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x)

Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0

Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0 Want to minimize the expected 0/1 loss for classifier h : X Y L(h(x), y) = { 0, if h(x) = y 1, if h(x) y

Risk of a Classifier The risk (expected loss) of a C-way classifier h(x)

Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)]

Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx x c=1

Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1

Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1 Clearly, it suffices to minimize the conditional risk: C R(h x) = L(h(x), c)p(y = c x) c=1

Conditional Risk of a Classifier R(h x) = C L(h(x), c)p(y = c x) c=1

Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) c h(x)

Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x)

Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x) To minimize the conditional risk given x, the classifier must decide h(x) = arg max p(y = c x) c

Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c

Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c

Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c For the binary case: h(x) = 1 p(y = 1 x) p(y = 0 x) 0

The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) = exp(θ 0 + θ T x) = 1

The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = 2

The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = p(y = 1 x) = = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = 2 1 1 + exp( θ 0 θ T x) = 1 2

The Logistic Function Properties? p(y = 1 x) = 1 1 + exp( θ 0 θ T x)

The Logistic Function Properties? p(y = 1 x) = 1 1 + exp( θ 0 θ T x) With linear logistic model we get a linear decision boundary

Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0

Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i

Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i The log-likelihood of θ: N log p(y X; θ) = log p(y i x i ; θ) = i=1 N y i log σ(θ 0 + θ T x i ) + (1 y i ) log(1 σ(θ 0 + θ T x i )) i=1

The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1

The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero:

The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1

The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1 Can treat y i p(y i x i ) = y i σ(θ 0 + θ T x i ) as the prediction error

Finding Maxima No closed form solution for the Maximum Likelihood for this model!

Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ

Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex

Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex Gradient Descent/ascent (descent on log p(y x; θ), log loss)

Next time Feedforward Networks

Next time Feedforward Networks Backpropagation