Lecture 2 Machine Learning Review
|
|
- Eric Holmes
- 5 years ago
- Views:
Transcription
1 Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017
2 Things we will look at today Formal Setup for Supervised Learning
3 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization
4 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression
5 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization
6 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification
7 Things we will look at today Formal Setup for Supervised Learning Empirical Risk, Risk, Generalization Define and derive a linear model for Regression Revise Regularization Define and derive a linear model for Classification (Time permitting) Start with Feedforward Networks
8 Note: Most slides in this presentation are adapted from, or taken (with permission) from slides by Professor Gregory Shakhnarovich for his TTIC course
9 What is Machine Learning? Right question: What is learning?
10 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
11 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price.
12 What is Machine Learning? Right question: What is learning? Tom Mitchell ( Machine Learning, 1997): A Computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Gregory Shakhnarovich: Make predictions and pay the price if the predictions are incorrect. Goal of learning is to reduce the price. How can you specify T, P and E?
13 Formal Setup (Supervised) Input data space X
14 Formal Setup (Supervised) Input data space X Output (label) space Y
15 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y
16 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y
17 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification
18 Formal Setup (Supervised) Input data space X Output (label) space Y Unknown function f : X Y We have a dataset D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} with x i X, y i Y Finite Y = Classification Continuous Y = Regression
19 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R
20 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y
21 Regression We are given a set of N observations (x i, y i ) with i = 1,..., N with y i R Example: Measurements (possibly noisy) of barometric pressure x versus liquid boiling point y Does it make sense to use learning here?
22 Fitting Function to Data We will approach this in two steps:
23 Fitting Function to Data We will approach this in two steps: Choose a model class of functions
24 Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class
25 Fitting Function to Data We will approach this in two steps: Choose a model class of functions Design a criteria to guide the selection of one function from the selected class Let us begin with considering one of the simpled model classes: linear functions
26 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )}
27 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set
28 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor
29 Linear Fitting to Data We want to fit a linear function to (X, Y ) = {(x 1, y 1 )... (x N, y N )} Fitting criteria: Least squares. Find the function that minimizes the sum of squared distances between actual ys in the training set We then use the fitted line as a predictor
30 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d
31 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line
32 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line X R 2 : a plane
33 Linear Functions General form: f(x; θ) = θ 0 + θ 1 x θ d x d 1-D case: A line X R 2 : a plane Hyperplane in general d-d case
34 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R
35 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R
36 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y
37 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss
38 Loss Functions Targets are in Y Binary Classification: Y = { 1, +1} Univariate Regression: Y R A Loss Function L : Y Y R L maps decisions to costs. L(ŷ, y) is the penalty for predicting ŷ when the correct answer is y Standard choice for classification: 0/1 loss Standard choice for regression: L(ŷ, y) = (ŷ y) 2
39 Empirical Loss Consider a parametric function f(x; θ)
40 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x
41 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1
42 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1
43 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L
44 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples
45 Empirical Loss Consider a parametric function f(x; θ) Example: Linear function - f(x; θ) = θ 0 + d j=1 θ jx ij = θ T x Note: x i0 1 The empirical loss of function y = f(x; θ) on a set X: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Least squares minimizes empirical loss for squared loss L We care about: predicting labels for new examples When does empirical loss minimization help us in doing that?
46 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y)
47 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data
48 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1
49 Loss: Empirical and Expected Basic Assumption: Example and label pairs (x, y) are drawn from an unknown distribution p(x, y) Data are i.i.d: Same (abeit unknown) distribution for all (x, y) in both training and test data The empirical loss is measured on the training set: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 The goal is to minimize the expected loss, also known as risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]
50 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1
51 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )]
52 Empirical Loss: Empirical Risk Minimization L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk
53 Empirical Risk Minimization Empirical Loss: L(θ, X, y) = 1 N N L(f(x i ; θ), y i ) i=1 Risk: R(θ) = E (x0,y 0 ) p(x,y)[l(f(x 0 ; θ), y 0 )] Empirical Risk Minimization: If the training set is a representative of the underlying (unknown) distribution p(x, y), the empirical loss is a proxy for the risk In essence: Estimate p(x, y) by the empirical distribution of the data
54 Learning via Empirical Loss Minimization Learning is done in two steps:
55 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x
56 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1
57 Learning via Empirical Loss Minimization Learning is done in two steps: Select a restricted class H of hypotheses f : X Y Example: linear functions parameterized by θ: f(x, y) = θ T x Select a hypothesis f H based on the training set D = (X, Y ) Example: minimize empirical squared loss. That is, select f(x, θ ) such that: θ = arg min θ N (y i θ T x i ) 2 i=1 How do we find θ = [θ 0, θ 1,..., θ d ]?
58 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d
59 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero
60 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d
61 Least Squares: Estimation Necessary condition to minimize L: L(θ) θ 0, L(θ) θ 1,..., L(θ) θ d must be set to zero Gives us d + 1 linear equations in d + 1 unknowns θ 0, θ 1,..., θ d Let us switch to vector notation for convenience
62 Learning via Empirical Loss Minimization First let us write down least squares in matrix form:
63 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ;
64 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ;
65 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ)
66 Learning via Empirical Loss Minimization First let us write down least squares in matrix form: Predictions: ŷ = Xθ; errors: y Xθ; empirical loss: L(θ, X) = 1 N (y Xθ)T (y Xθ) = (y T θ T X T )(y Xθ) What next? Take derivative of L(θ) and set it to zero
67 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ)
68 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a = bt a a = b and at Ba a = 2Ba
69 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ]
70 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ]
71 Derivative of Loss L(θ) = 1 N (yt θ T X T )(y Xθ) Use identities at b a L(θ) θ = 1 N = bt a a = b and at Ba a = 2Ba θ [yt y θ T X T y y T Xθ + θ T X T Xθ] = 1 N [0 XT y (y T X) T + 2X T Xθ] = 2 N (XT y X T Xθ)
72 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0
73 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y
74 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X
75 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution!
76 Least Squares Solution L(θ) θ = 2 N (XT y X T Xθ) = 0 X T y = X T Xθ = θ = (X T X) 1 X T y X = (X T X) 1 X T is the Moore-Penrose pseudoinverse of X Linear regression infact has a closed form solution! Prediction: ŷ = θ T [ 1 x 0 ] = y T X T [ 1 x 0 ]
77 Polynomial Regression Transform x φ(x)
78 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity:
79 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m
80 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ]
81 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ!
82 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression!
83 Polynomial Regression Transform x φ(x) For example consider 1D for simplicity: f(x; θ) = θ 0 + θ 1 x + θ 2 x θ m x m Above φ(x) = [1, x, x 2,..., x m ] No longer linear in x, but still linear in θ! Back to familiar linear regression! Generalized Linear models: f(x; θ) = θ 0 + θ 1 φ 1 (x) + θ 2 φ 2 (x) + + θ m φ m (x)
84 A Short Primer on Regularization
85 Model Complexity and Overfitting Consider data drawn from a 3rd order model:
86 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable
87 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set
88 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation
89 Avoiding Overfitting: Cross Validation If model overfits i.e. it is too sensitive to data: It will be unstable Idea: hold out part of the data, fit model on rest and test on held out set k-fold cross validation. Extreme case: leave one out cross validation What is the source of overfitting?
90 Cross Validation Example
91 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom )
92 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit
93 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit
94 Model Complexity Model complexity is the number of independent parameters to be fit ( degrees of freedom ) Complex model = more sensitive to data = more likely to overfit Simple model = more rigid = more likely to underfit Find the model with the right bias-variance balance
95 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data
96 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1
97 Penalizing Model Complexity Idea 1: Restrict model complexity based on amount of data Idea 2: Directly penalize by the number of parameters (called the Akaike Information criterion): minimize N L(f(x i ; θ), y i ) + #params i=1 Since the parameters might not be independent, we would like to penalize the complexity in a more sophisticated way
98 Problems
99 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters
100 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters.
101 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1
102 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc.
103 Description Length Intuition: Should not penalize the parameters, but the number of bits needed to encode the parameters With a finite set of parameter values, these are equivalent. With an infinite set, we can limit the effective number of degrees of freedom by restricting the value of the parameters. Then we can have Regularized Risk minimization: N L(f(x i ; θ), y i ) + Ω(θ) i=1 We can measure size in different ways: L1, L2 norms etc. etc. Regularization is basically a way to implement Occam s Razor
104 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1
105 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1
106 Shrinkage Regression Shrinkage methods: We penalize the L2 norm θ ridge = arg min θ N m L(f(x i ; θ), y i ) + λ (θ j ) 2 i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ (θ j ) 2 i=1 j=1 This is called Ridge regression: Closed form solution for squared loss ˆθ ridge = (λi + X T X) 1 X T y!
107 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1
108 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1
109 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution!
110 LASSO Regression LASSO: We penalize the L1 norm θ lasso = arg min θ N m L(f(x i ; θ), y i ) + λ θ j i=1 j=1 If we use likelihood: θ ridge = arg max θ N m log p(data i ; θ) λ θ j i=1 j=1 This is called LASSO regression: No closed form solution! Still convex, but no longer smooth. Solve using Lagrange multipliers!
111 Effect of λ
112 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )})
113 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ))
114 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ
115 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best
116 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean?
117 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2.
118 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1
119 The Principle of Maximum Likelihood Suppose we have N data points X = {x 1, x 2,..., x N } (or {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}) Suppose we know the probability distribution function that describes the data p(x; θ) (or p(y x; θ)) Suppose we want to determine the parameter(s) θ Pick θ so as to explain your data best What does this mean? Suppose we had two parameter values (or vectors) θ 1 and θ 2. Now suppose you were to pretend that θ 1 was really the true value parameterizing p. What would be the probability that you would get the dataset that you have? Call this P 1 If P 1 is very small, it means that such a dataset is very unlikely to occur, thus perhaps θ 1 was not a good guess
120 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have
121 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ)
122 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function
123 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure:
124 The Principle of Maximum Likelihood We want to pick θ ML i.e. the best value of θ that explains the data you have The plausibility of given data is measured by the likelihood function p(x; θ) Maximum Likelihood principle thus suggests we pick θ that maximizes the likelihood function The procedure: Write the log likelihood function: log p(x; θ) (we ll see later why log) Want to maximize - So differentiate log p(x; θ) w.r.t θ and set to zero Solve for θ that satisfies the equation. This is θ ML
125 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data
126 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem
127 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example)
128 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation:
129 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes
130 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation
131 The Principle of Maximum Likelihood As an aside: Sometimes we have an initial guess for θ BEFORE seeing the data We then use the data to refine our guess of θ using Bayes Theorem This is called MAP (Maximum a posteriori) estimation (we ll see an example) Advantages of ML Estimation: Cookbook, turn the crank method Optimal for large data sizes Disadvantages of ML Estimation Not optimal for small sample sizes Can be computationally challenging (numerical methods)
132 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x)
133 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0
134 Linear Classifiers ŷ = h(x) = sign(θ 0 + θ T x) We need to find the (direction) θ and (the location) θ 0 Want to minimize the expected 0/1 loss for classifier h : X Y L(h(x), y) = { 0, if h(x) = y 1, if h(x) y
135 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x)
136 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)]
137 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx x c=1
138 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1
139 Risk of a Classifier The risk (expected loss) of a C-way classifier h(x) R(x) = E x,y [L(h(x), y)] C = L(h(x), c)p(x, y = c)dx = x c=1 x [ C ] L(h(x), c)p(y = c x) p(x)dx c=1 Clearly, it suffices to minimize the conditional risk: C R(h x) = L(h(x), c)p(y = c x) c=1
140 Conditional Risk of a Classifier R(h x) = C L(h(x), c)p(y = c x) c=1
141 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) c h(x)
142 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x)
143 Conditional Risk of a Classifier C R(h x) = L(h(x), c)p(y = c x) c=1 = 0 p(y = h(x) x) + 1 p(y = c x) = c h(x) c h(x) p(y = c x) = 1 p(y = h(x) x) To minimize the conditional risk given x, the classifier must decide h(x) = arg max p(y = c x) c
144 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c
145 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c
146 Log Odds Ratio Optimal rule h(x) = arg max c p(y = c x) is equivalent to: h(x) = c p(y = c x) p(y = c x) 1 c log p(y = c x) p(y = c x) 0 c For the binary case: h(x) = 1 p(y = 1 x) p(y = 0 x) 0
147 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) = exp(θ 0 + θ T x) = 1
148 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = 2
149 The Logistic Model The unknown decision boundary can be modeled directly: p(y = 1 x) p(y = 0 x) = θ 0 + θ T x = 0 Since p(y = 1 x) = 1 p(y = 0 x), exponentiating, we have: p(y = 1 x) 1 p(y = 1 x) 1 = p(y = 1 x) = p(y = 1 x) = = exp(θ 0 + θ T x) = 1 = 1 + exp( θ 0 θ T x) = exp( θ 0 θ T x) = 1 2
150 The Logistic Function Properties? p(y = 1 x) = exp( θ 0 θ T x)
151 The Logistic Function Properties? p(y = 1 x) = exp( θ 0 θ T x) With linear logistic model we get a linear decision boundary
152 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0
153 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i
154 Likelihood under the Logistic Model p(y i x; θ) = We can rewrite this as: { σ(θ 0 + θ T x i ) if y i = 1 1 σ(θ 0 + θ T x i ) if y i = 0 p(y i x; θ) = σ(θ 0 + θ T x i ) y i (1 σ(θ 0 + θ T x i )) 1 y i The log-likelihood of θ: N log p(y X; θ) = log p(y i x i ; θ) = i=1 N y i log σ(θ 0 + θ T x i ) + (1 y i ) log(1 σ(θ 0 + θ T x i )) i=1
155 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1
156 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero:
157 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1
158 The Maximum Likelihood Solution log p(y X; θ) = N y i log σ(θ 0 +θ T x i )+(1 y i ) log(1 σ(θ 0 +θ T x i )) i=1 Setting derivatives to zero: log p(y X; θ) θ 0 = N (y i σ(θ 0 + θ T x i )) = 0 i=1 log p(y X; θ) θ j = N (y i σ(θ 0 + θ T x i ))x i,j = 0 i=1 Can treat y i p(y i x i ) = y i σ(θ 0 + θ T x i ) as the prediction error
159 Finding Maxima No closed form solution for the Maximum Likelihood for this model!
160 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ
161 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex
162 Finding Maxima No closed form solution for the Maximum Likelihood for this model! But log p(y X; x) is jointly concave in all components of θ Or, equivalently, the error is convex Gradient Descent/ascent (descent on log p(y x; θ), log loss)
163 Next time Feedforward Networks
164 Next time Feedforward Networks Backpropagation
Lecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationLogistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationMachine Learning Gaussian Naïve Bayes Big Picture
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationIntroduction to Statistical modeling: handout for Math 489/583
Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More information