Lecture 2 Machine Learning Review

Size: px

Start display at page:

Download "Lecture 2 Machine Learning Review"

Eric Holmes
5 years ago
Views:

1 Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017

2 Things we will look at today Formal Setup for Supervised Learning

3 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization

4 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression

5 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization

6 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification

7 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification (Time permitting) Start with Feedforward Networks

8 Note: Most slides in this presentation are adapted from, or taken (with permission) from slides by Professor Gregory Shakhnarovich for his TTIC course

9 What is Machine Learning? Right question: What is learning?

10 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

11 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price.

12 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price. How can you specify T, P and E?

13 Formal Setup (Supervised) Input data space X

14 Formal Setup (Supervised) Input data space X Output (label) space Y

15 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y

16 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y

17 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification

18 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification Continuous Y = Regression

19 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R

20 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y

21 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y Does it make sense to use learning here?

22 Fitting Function to Data We will approach this in two steps:

23 Fitting Function to Data We will approach this in two steps: Choose a model class of functions

24 Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class

25 Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class Let us begin with considering one of the simpled model classes: linear functions

26 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )}

27 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set

28 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor

29 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor

30 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d

31 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line

32 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line X R 2 : a plane

Linear Functions General form: f(x; θ) = θ 0 + θ 1 x 1 +.

33 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line X R 2 : a plane Hyperplane in general d-d case

34 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R

35 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R

36 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y

37 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss

38 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss Standard choice for regression: L(ŷ, y) = (ŷ y) 2

39 Empirical Loss Consider a parametric function f(x; θ)

40 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x

41 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1

42 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

43 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L

44 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples

45 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples When does empirical loss minimization help us in doing that?

46 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y)

47 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data

48 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

49 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 The goal is to minimize the expected loss, also known as risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]

50 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

51 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]

52 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk

53 Empirical Risk Minimization Empirical Loss: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk In essence: Estimate p(x, y) by the empirical distribution of the data

54 Learning via Empirical Loss Minimization Learning is done in two steps:

55 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x

56 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1

57 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1 How do we find θ = [θ 0, θ 1,..., θ d ]?

58 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d

59 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero

60 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d

61 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d Let us switch to vector notation for convenience

62 Learning via Empirical Loss Minimization First let us write down least squares in matrix form:

63 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ;

64 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ;

65 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ)

66 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ) What next? Take derivative of L(θ) and set it to zero

67 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ)

68 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a = bt a a = b and at Ba a = 2Ba

69 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ]

70 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ]

71 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ] = 2 N (XT y X T Xθ)

72 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0

73 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y

74 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X

75 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution!

76 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution! Prediction: ŷ = θ T [ 1 x 0 ] = y T X T [ 1 x 0 ]

77 Polynomial Regression Transform x φ(x)

78 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity:

79 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m

80 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ]

81 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ!

82 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression!

83 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression! Generalized Linear models: f(x; θ) = θ 0 + θ 1 φ 1 (x) + θ 2 φ 2 (x) + + θ m φ m (x)

84 A Short Primer on Regularization

85 Model Complexity and Overfitting Consider data drawn from a 3rd order model:

86 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable

87 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set

88 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation

89 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation What is the source of overfitting?

90 Cross Validation Example

91 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom )

92 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit

93 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit

94 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit Find the model with the right bias-variance balance

95 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data

96 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1

97 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1 Since the parameters might not be independent, we would like to penalize the complexity in a more sophisticated way

98 Problems

99 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters

100 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters.

101 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1

102 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc.

103 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc. Regularization is basically a way to implement Occam s Razor

104 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1

105 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1

106 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1 This is called Ridge regression: Closed form solution for squared loss ˆθ ridge = (λi + X T X) 1 X T y!

107 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1

108 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1

109 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution!

110 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution! Still convex, but no longer smooth. Solve using Lagrange multipliers!

111 Effect of λ

112 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )})

113 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ))

114 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ

115 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best

116 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean?

117 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2.

118 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1

119 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1 If P 1 is very small, it means that such a dataset is very unlikely to occur, thus perhaps θ 1 was not a good guess

120 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have

121 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ)

122 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function

123 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure:

124 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure: Write the log likelihood function: log p(x; θ) (we ll see later why log) Want to maximize - So differentiate log p(x; θ) w.r.t θ and set to zero Solve for θ that satisfies the equation. This is θ ML

125 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data

126 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem

127 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example)

128 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation:

129 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes

130 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation

131 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation Not optimal for small sample sizes Can be computationally challenging (numerical methods)

132 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x)

133 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0

134 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0 Want to minimize the expected 0/1 loss for classifier h : X Y L(h(x), y) = { 0, if h(x) = y 1, if h(x) y

135 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x)

136 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)]

137 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx x c=1

138 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1

139 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1 Clearly, it suffices to minimize the conditional risk: C R(h x) = L(h(x), c)p(y = c x) c=1

140 Conditional Risk of a Classifier R(h x) = C L(h(x), c)p(y = c x) c=1

141 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) c h(x)

142 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x)

143 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x) To minimize the conditional risk given x, the classifier must decide h(x) = arg max p(y = c x) c

144 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c

145 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c

146 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c For the binary case: h(x) = 1 p(y = 1 x) p(y = 0 x) 0

147 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) = exp(θ 0 + θ T x) = 1

148 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = 2

149 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = p(y = 1 x) = = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = exp( θ 0 θ T x) = 1 2

150 The Logistic Function Properties? p(y = 1 x) = exp( θ 0 θ T x)

151 The Logistic Function Properties? p(y = 1 x) = exp( θ 0 θ T x) With linear logistic model we get a linear decision boundary

152 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0

153 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i

154 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i The log-likelihood of θ: N log p(y X; θ) = log p(y i x i ; θ) = i=1 N y i log σ(θ 0 + θ T x i ) + (1 y i ) log(1 σ(θ 0 + θ T x i )) i=1

155 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1

156 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero:

157 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1

158 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1 Can treat y i p(y i x i ) = y i σ(θ 0 + θ T x i ) as the prediction error

159 Finding Maxima No closed form solution for the Maximum Likelihood for this model!

160 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ

161 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex

162 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex Gradient Descent/ascent (descent on log p(y x; θ), log loss)

163 Next time Feedforward Networks

164 Next time Feedforward Networks Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression