Lecture 2 Machine Learning Review

Size: px
Start display at page:

Download "Lecture 2 Machine Learning Review"

Transcription

1 Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017

2 Things we will look at today Formal Setup for Supervised Learning

3 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization

4 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression

5 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization

6 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification

7 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification (Time permitting) Start with Feedforward Networks

8 Note: Most slides in this presentation are adapted from, or taken (with permission) from slides by Professor Gregory Shakhnarovich for his TTIC course

9 What is Machine Learning? Right question: What is learning?

10 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

11 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price.

12 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price. How can you specify T, P and E?

13 Formal Setup (Supervised) Input data space X

14 Formal Setup (Supervised) Input data space X Output (label) space Y

15 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y

16 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y

17 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification

18 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification Continuous Y = Regression

19 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R

20 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y

21 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y Does it make sense to use learning here?

22 Fitting Function to Data We will approach this in two steps:

23 Fitting Function to Data We will approach this in two steps: Choose a model class of functions

24 Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class

25 Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class Let us begin with considering one of the simpled model classes: linear functions

26 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )}

27 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set

28 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor

29 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor

30 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d

31 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line

32 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line X R 2 : a plane

33 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line X R 2 : a plane Hyperplane in general d-d case

34 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R

35 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R

36 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y

37 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss

38 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss Standard choice for regression: L(ŷ, y) = (ŷ y) 2

39 Empirical Loss Consider a parametric function f(x; θ)

40 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x

41 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1

42 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

43 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L

44 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples

45 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples When does empirical loss minimization help us in doing that?

46 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y)

47 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data

48 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

49 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 The goal is to minimize the expected loss, also known as risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]

50 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1

51 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]

52 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk

53 Empirical Risk Minimization Empirical Loss: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk In essence: Estimate p(x, y) by the empirical distribution of the data

54 Learning via Empirical Loss Minimization Learning is done in two steps:

55 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x

56 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1

57 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1 How do we find θ = [θ 0, θ 1,..., θ d ]?

58 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d

59 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero

60 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d

61 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d Let us switch to vector notation for convenience

62 Learning via Empirical Loss Minimization First let us write down least squares in matrix form:

63 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ;

64 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ;

65 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ)

66 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ) What next? Take derivative of L(θ) and set it to zero

67 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ)

68 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a = bt a a = b and at Ba a = 2Ba

69 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ]

70 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ]

71 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ] = 2 N (XT y X T Xθ)

72 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0

73 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y

74 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X

75 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution!

76 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution! Prediction: ŷ = θ T [ 1 x 0 ] = y T X T [ 1 x 0 ]

77 Polynomial Regression Transform x φ(x)

78 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity:

79 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m

80 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ]

81 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ!

82 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression!

83 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression! Generalized Linear models: f(x; θ) = θ 0 + θ 1 φ 1 (x) + θ 2 φ 2 (x) + + θ m φ m (x)

84 A Short Primer on Regularization

85 Model Complexity and Overfitting Consider data drawn from a 3rd order model:

86 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable

87 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set

88 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation

89 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation What is the source of overfitting?

90 Cross Validation Example

91 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom )

92 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit

93 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit

94 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit Find the model with the right bias-variance balance

95 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data

96 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1

97 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1 Since the parameters might not be independent, we would like to penalize the complexity in a more sophisticated way

98 Problems

99 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters

100 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters.

101 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1

102 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc.

103 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc. Regularization is basically a way to implement Occam s Razor

104 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1

105 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1

106 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1 This is called Ridge regression: Closed form solution for squared loss ˆθ ridge = (λi + X T X) 1 X T y!

107 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1

108 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1

109 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution!

110 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution! Still convex, but no longer smooth. Solve using Lagrange multipliers!

111 Effect of λ

112 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )})

113 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ))

114 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ

115 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best

116 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean?

117 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2.

118 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1

119 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1 If P 1 is very small, it means that such a dataset is very unlikely to occur, thus perhaps θ 1 was not a good guess

120 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have

121 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ)

122 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function

123 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure:

124 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure: Write the log likelihood function: log p(x; θ) (we ll see later why log) Want to maximize - So differentiate log p(x; θ) w.r.t θ and set to zero Solve for θ that satisfies the equation. This is θ ML

125 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data

126 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem

127 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example)

128 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation:

129 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes

130 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation

131 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation Not optimal for small sample sizes Can be computationally challenging (numerical methods)

132 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x)

133 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0

134 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0 Want to minimize the expected 0/1 loss for classifier h : X Y L(h(x), y) = { 0, if h(x) = y 1, if h(x) y

135 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x)

136 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)]

137 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx x c=1

138 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1

139 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1 Clearly, it suffices to minimize the conditional risk: C R(h x) = L(h(x), c)p(y = c x) c=1

140 Conditional Risk of a Classifier R(h x) = C L(h(x), c)p(y = c x) c=1

141 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) c h(x)

142 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x)

143 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x) To minimize the conditional risk given x, the classifier must decide h(x) = arg max p(y = c x) c

144 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c

145 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c

146 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c For the binary case: h(x) = 1 p(y = 1 x) p(y = 0 x) 0

147 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) = exp(θ 0 + θ T x) = 1

148 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = 2

149 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = p(y = 1 x) = = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = exp( θ 0 θ T x) = 1 2

150 The Logistic Function Properties? p(y = 1 x) = exp( θ 0 θ T x)

151 The Logistic Function Properties? p(y = 1 x) = exp( θ 0 θ T x) With linear logistic model we get a linear decision boundary

152 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0

153 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i

154 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i The log-likelihood of θ: N log p(y X; θ) = log p(y i x i ; θ) = i=1 N y i log σ(θ 0 + θ T x i ) + (1 y i ) log(1 σ(θ 0 + θ T x i )) i=1

155 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1

156 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero:

157 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1

158 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1 Can treat y i p(y i x i ) = y i σ(θ 0 + θ T x i ) as the prediction error

159 Finding Maxima No closed form solution for the Maximum Likelihood for this model!

160 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ

161 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex

162 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex Gradient Descent/ascent (descent on log p(y x; θ), log loss)

163 Next time Feedforward Networks

164 Next time Feedforward Networks Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information