Recap from previous lecture

Size: px
Start display at page:

Download "Recap from previous lecture"

Transcription

1 Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience is encoded in data

2 Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s

3 Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s Desired outcome: a rule (function) for deciding whether a given pattern of black-and-white pixels is a 4 or a 9 x {0, 1} 100 by { 1, +1}

4 Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s Desired outcome: a rule (function) for deciding whether a given pattern of black-and-white pixels is a 4 or a 9 x {0, 1} 100 by { 1, +1} How to get there? assuming that the training data are representative of a stable probabilistic relationship between the input (x) and the output (y), do what you would do if you actually knew this relationship and could find an optimal predictor. Figure taken from G. Shakhnarovich

5 Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X

6 Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x)

7 Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) Expected prediction error of a candidate predictor f: E(f) = E[l(Y, f(x))] Z = l(y, f(x))p XY (dx, dy)

8 Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) Expected prediction error of a candidate predictor f: E(f) = E[l(Y, f(x))] Z = l(y, f(x))p XY (dx, dy) Optimal predictor: f = arg min E(f), E E(f ) f

9 Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X

10 Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use 0-1 loss: l(y, f(x)) = 1 {Y f(x)}

11 Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use 0-1 loss: l(y, f(x)) = 1 {Y f(x)} Expected prediction error of a candidate predictor f: E(f) = P[Y f(x)] Z = 1 {y f(x)} P XY (dx, dy)

12 Minimizing expected classification error How to pick f to minimize E(f)?

13 Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy)

14 Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = «1 {y f(x)} P Y X (dy x) P X (dx)

15 Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx)

16 Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx) = 1 Z `P[Y = 0 X = x] 1{f(x)=0} + P[Y = 1 X = x] 1 {f(x)=1} PX (dx)

17 Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx) = 1 Z `P[Y = 0 X = x] 1{f(x)=0} + P[Y = 1 X = x] 1 {f(x)=1} PX (dx) Maximize the inner integral for each fixed x by setting f (x) = arg max P[Y = y X = x] y {0,1} This is known as the Bayes classifier: classify to the most probable label given the observation X = x. E(f ) is known as the Bayes rate.

18 The Bayes classifier Classify to the most probable label: Define the regression function: f (x) = arg max P[Y = y X = x] y {0,1} η(x) = P[Y = 1 X = x].

19 The Bayes classifier Classify to the most probable label: Define the regression function: f (x) = arg max P[Y = y X = x] y {0,1} This leads to the following: Bayes classifier η(x) = P[Y = 1 X = x]. η(x) ( f 1, if η(x) 1/2 (x) = 0, if η(x) < 1/2 E = Z 1 2η(x) P X (dx) 2 1/ x

20 Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X

21 Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use squared error loss: l(y, f(x)) = (Y f(x)) 2

22 Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use squared error loss: l(y, f(x)) = (Y f(x)) 2 Expected prediction error of a candidate predictor f: E(f) = E ˆ(Y f(x)) 2 Z = (y f(x)) 2 P XY (dx, dy)

23 Minimizing mean squared error How to pick f to minimize E(f)?

24 Minimizing mean squared error How to pick f to minimize E(f)? Conditioning: Z E(f) = (y f(x)) 2 P XY (dx, dy) Z Z «= (y f(x)) 2 P Y X (dy x) P X (dx) {z } E[(Y f(x)) 2 X=x]

25 Minimizing mean squared error How to pick f to minimize E(f)? Conditioning: Z E(f) = (y f(x)) 2 P XY (dx, dy) Z Z «= (y f(x)) 2 P Y X (dy x) P X (dx) {z } E[(Y f(x)) 2 X=x] Minimize the inner integral for each fixed x by setting The solution is the regression function: f (x) = arg min E[(Y c) 2 X = x] c f (x) = E[Y X = x]

26 Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ]

27 Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ] We now show that the middle term is zero: E[(Y f (X))(f (X) f(x))] = E ˆE ˆ(Y f (X))(f (X) f(x)) X = E ˆE ˆY f (X) X (f (X) f(x)) " # = E E[Y X] f (X) (f (X) f(x)) {z } f (X) 0

28 Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ] We now show that the middle term is zero: E[(Y f (X))(f (X) f(x))] = E ˆE ˆ(Y f (X))(f (X) f(x)) X Therefore, for any f we have = E ˆE ˆY f (X) X (f (X) f(x)) " # = E E[Y X] f (X) (f (X) f(x)) {z } f (X) 0 E(f) = E(f ) + E[(f(X) f (X))] 2 E(f ), and the term E[(f(X) f (X)) 2 ] = 0 if and only if f = f.

29 Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias.

30 Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias. Example: best linear predictor Suppose we are only interested in predictors of the form f(x) = β T x = where β R p is some fixed vector of weights px β(i)x(i) i=1

31 Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias. Example: best linear predictor Suppose we are only interested in predictors of the form f(x) = β T x = where β R p is some fixed vector of weights Then px β(i)x(i) i=1 E(f) = E(β) = E[(Y β T X) 2 ] = E[Y 2 ] 2β T E[Y X] + E[(β T X) 2 ]. Optimize by differentiating w.r.t. β to get the optimal linear predictor: β = (E[XX T ]) 1 E[XY ]

32 Decomposition of the expected prediction error What happens when we use restricted predictors?

33 Decomposition of the expected prediction error What happens when we use restricted predictors? Let F be a restricted class of candidate predictors (e.g., linear or quadratic ones); define the best prediction error over F: E F = min f F E(f).

34 Decomposition of the expected prediction error What happens when we use restricted predictors? Let F be a restricted class of candidate predictors (e.g., linear or quadratic ones); define the best prediction error over F: Then for any f F we have where: E F = min f F E(f). E(f) = E(f) E + E = E(f) EF + EF {z } E +E {z } =T 1 =T 2 T 1 can be set to zero by letting f achieve the minimum min f F E(f) T 2 is large or small depending on how well the predictors in F can do

35 Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY?

36 Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1

37 Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1 Why does it work? Without getting into details, the reason is the Law of Large Numbers: for any f, be(f) E(f) with high probability

38 Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1 Why does it work? Without getting into details, the reason is the Law of Large Numbers: for any f, be(f) E(f) with high probability It can then be proved that if b f minimizes the empirical prediction error, then E( b f ) E(f ) with high probability where f is the true optimal predictor (for P XY )

39 Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1

40 Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ)

41 Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0

42 Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0 If X T X is invertible, solve for the optimum: bβ = (X T X) 1 X T Y

43 Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0 If X T X is invertible, solve for the optimum: bβ = (X T X) 1 X T Y Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

44 Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier.

45 Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y

46 Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2

47 Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2 Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

48 Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2 Figure taken from Hastie, Tibshirani and Friedman, 2nd ed. This will work well if the two classes are linearly separable (or almost so).

49 Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous.

50 Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous. Nearest-neighbor methods k-nearest-neighbor regression: by (x) = 1 k k-nn classification: 8 1, if >< by (x) = >: 0, if X X i N k (x) X X i N k (x) X X i N k (x) Y i Y i k/2 Y i < k/2 N k (x): the set of k nearest neighbors of x in the training sample

51 Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous. Nearest-neighbor methods k-nearest-neighbor regression: by (x) = 1 k k-nn classification: 8 1, if >< by (x) = >: 0, if X X i N k (x) X X i N k (x) X X i N k (x) Y i Y i k/2 Y i < k/2 N k (x): the set of k nearest neighbors of x in the training sample Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

52 Issues and pitfalls Curse of dimensionality Local methods (such as k-nearest-neighbors) require fairly dense training samples, which is hard to obtain when the problem dimension is very high. Bias-variance trade-off Using more complex models may lead to overfitting; using models that are overly simple may lead to underfitting. In both cases, we will have poor generalization.

53 The Curse of Dimensionality For nearest-neighbor methods, we need to capture a fixed fraction of the data (say, 10%) to form local averages. In high dimension, the data will be spread very thinly, and we will need to use almost the entire sample, which would conflict with the local nature of the method. Figure taken from Hasty, Tibshirani and Friedman, 2nd ed.

54 Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ):

55 Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ]

56 Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T b f(x)) 2 ] + 2 E T [(Y E T b f(x))(et b f(x) b f(x))] {z } =0 + E T [( b f(x) E T b f(x)) 2 ]

57 Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2

58 Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2 h( f(x) b E T f(x)) b i 2 = O(ɛ) + [f (X) E T f(x)] b 2 + E T {z } =Bias 2 {z } =Variance

59 Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2 h( f(x) b E T f(x)) b i 2 = O(ɛ) + [f (X) E T f(x)] b 2 + E T {z } =Bias 2 {z } =Variance Bias: how well b f approximates the true regression function f Variance: how much b f(x) wiggles around its mean

60 Bias-variance trade-off There is a trade-off between the bias and the variance: Complex predictors will generally have low bias and high variance Simple predictors will generally have high bias and low variance Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

61 Regularization One way to mitigate the bias-variance trade-off is to explicitly penalize predictors that are too complex.

62 Regularization One way to mitigate the bias-variance trade-off is to explicitly penalize predictors that are too complex. Instead of finding the minimizer of the empirical prediction error be(f) = 1 nx l(y i, f(x i )), n i=1 minimize the complexity-regularized empirical prediction error where: be cr(f) = E(f) b + λj(f) = 1 nx l(y i, f(x i )) + λj(f), n i=1 The regularizer J(f) is high for complicated predictors f and low for simple predictors f The regularization constant λ > 0 controls the balance between the data fit term b E(f) and the complexity term J(f)

63 Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline.

64 Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline. Subset selection: F is the class of linear regression functions f(x) = β T x, and J(β) = β 0 = the number of nonzero components of β This is effective, but may be computationally expensive.

65 Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline. Subset selection: F is the class of linear regression functions f(x) = β T x, and J(β) = β 0 = the number of nonzero components of β This is effective, but may be computationally expensive. l 1 -penalty: F is again the class of linear regression functions, and J(β) = β 1 = px β(i) This often serves as a reasonable tractable relaxation of the subset selection penalty. This approach is known as the LASSO, and is very popular right now. i=1

66 Next lecture: a sneak preview Next time, we will focus on linear methods for classification: Logistic regression Separating hyperplanes and the Perceptron

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

Probability and Statistical Decision Theory

Probability and Statistical Decision Theory Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2) This is my preperation notes for teaching in sections during the winter 2018 quarter for course CSE 446. Useful for myself to review the concepts as well. More Linear Algebra Definition 1.1 (Dot Product).

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

STK Statistical Learning: Advanced Regression and Classification

STK Statistical Learning: Advanced Regression and Classification STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Machine Learning: A Statistics and Optimization Perspective

Machine Learning: A Statistics and Optimization Perspective Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109 What is Machine Learning? 2 / 109 Machine Learning Machine learning

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Classification and Support Vector Machine

Classification and Support Vector Machine Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline

More information

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Lecture 13: Data Modelling and Distributions Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Why data distributions? It is a well established fact that many naturally occurring

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

GI01/M055: Supervised Learning

GI01/M055: Supervised Learning GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008. Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter 7.1-7.9 Model Assessment and Selection CN700/March 4, 2008 Satyavarta sat@cns.bu.edu Auditory Neuroscience Laboratory, Department

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information