Linear Regression (continued)

Size: px
Start display at page:

Download "Linear Regression (continued)"

Transcription

1 Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

2 Outline 1 Administration 2 Review of last lecture 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

3 Announcements HW2 will be returned in section on Friday HW3 due in class next Monday Midterm is next Wednesday (will review in more detail next class) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

4 Outline 1 Administration 2 Review of last lecture Perceptron Linear regression introduction 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

5 Perceptron Main idea Consider a linear model for binary classification w T x We use this model to distinguish between two classes { 1, +1}. One goal ε = n I[y n sign(w T x n )] i.e., to minimize errors on the training dataset. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

6 Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

7 Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 What does update do? y n [(w + y n x n ) T x n ] = y n w T x n + y 2 nx T n x n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

8 Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 What does update do? y n [(w + y n x n ) T x n ] = y n w T x n + y 2 nx T n x n We are adding a positive number, so it s possible that y n (w newt x n ) > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

9 Perceptron algorithm Iteratively solving one case at a time REPEAT Pick a data point x n (can be a fixed order of the training instances) Make a prediction y = sign(w T x n ) using the current w If y = y n, do nothing. Else, w w + y n x n UNTIL converged. Properties Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

10 Perceptron algorithm Iteratively solving one case at a time REPEAT Pick a data point x n (can be a fixed order of the training instances) Make a prediction y = sign(w T x n ) using the current w If y = y n, do nothing. Else, UNTIL converged. Properties This is an online algorithm. w w + y n x n If the training data is linearly separable, the algorithm stops in a finite number of steps (we proved this). The parameter vector is always a linear combination of training instances (requires initialization of w 0 = 0). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

11 Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

12 Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Key difference from classification Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

13 Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Key difference from classification We can measure closeness of prediction and labels Predicting shoe size: better to be off by one size than by 5 sizes Predicting song year: better to be off by one year than by 20 years As opposed to 0-1 classification error, we will focus on squared difference, i.e., (ŷ y) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

14 1D example: predicting the sale price of a house Sale price price per sqft square footage + fixed expense Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

15 Minimize squared errors Our model Sale price = price per sqft square footage + fixed expense + unexplainable stuff Training data sqft sale price prediction error squared error K 720K 90K K 800K 107K K 350K 38K ,600K 2,600K 0 0 Total Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

16 Minimize squared errors Our model Sale price = price per sqft square footage + fixed expense + unexplainable stuff Training data sqft sale price prediction error squared error K 720K 90K K 800K 107K K 350K 38K ,600K 2,600K 0 0 Total Aim Adjust price per sqft and fixed expense such that the sum of the squared error is minimized i.e., unexplainable stuff is minimized. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

17 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

18 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

19 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

20 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

21 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 1D Solution: Identify stationary points by taking derivative with respect to parameters and setting to zero, yielding normal equations Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

22 Probabilistic interpretation Noisy observation model Y = w 0 + w 1 X + η where η N(0, σ 2 ) is a Gaussian random variable Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

23 Probabilistic interpretation Noisy observation model Y = w 0 + w 1 X + η where η N(0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x n, y n ) p(y n x n ) = N(w 0 + w 1 x n, σ 2 ) = 1 e [yn (w 0 +w 1 xn)]2 2σ 2 2πσ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

24 Maximum likelihood estimation Maximize over w 0 and w 1 max log P (D) min n [y n (w 0 + w 1 x n )] 2 That is RSS( w)! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

25 Maximum likelihood estimation Maximize over w 0 and w 1 max log P (D) min n [y n (w 0 + w 1 x n )] 2 That is RSS( w)! Maximize over s = σ 2 { } log P (D) = 1 1 s 2 s 2 [y n (w 0 + w 1 x n )] 2 + N 1 = 0 s n σ 2 = s = 1 [y n (w 0 + w 1 x n )] 2 N n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

26 How does this probabilistic interpretation help us? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

27 How does this probabilistic interpretation help us? It gives a solid footing to our intuition: minimizing RSS( w) is a sensible thing based on reasonable modeling assumptions Estimating σ tells us how much noise there could be in our predictions. For example, it allows us to place confidence intervals around our predictions. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

28 Outline 1 Administration 2 Review of last lecture 3 Linear regression Multivariate solution in matrix form Computational and numerical optimization Ridge regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

29 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

30 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

31 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

32 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n = n (y n w T x n )(y n x T n w) w T x n x T n w 2y n x T n w + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

33 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) = w T x n x T n w 2y n x T n w + const. n { ( ) ( ) = w T x n x T n w 2 y n x T n n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

34 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =

35 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =

36 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices = = 28

37 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices = = 28

38 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =

39 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =

40 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =

41 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

42 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

43 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

44 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

45 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

46 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

47 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

48 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =

49 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

50 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1) x T N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

51 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1), y = x T N y 1 y 2. y N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

52 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1), y = x T N Compact expression { RSS( w) = X w y 2 2 = y 1 y 2. y N ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

53 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

54 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

55 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

56 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 This leads to the least-mean-square (LMS) solution w LMS = ( X T X) 1 X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

57 Mini-Summary Linear regression is the linear combination of features f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x If we minimize residual sum of squares as our learning objective, we get a closed-form solution of parameters Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

58 Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

59 Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Matrix multiply of X T X R (D+1) (D+1) Inverting the matrix X T X How many operations do we need? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

60 Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Matrix multiply of X T X R (D+1) (D+1) Inverting the matrix X T X How many operations do we need? O(ND 2 ) for matrix multiplication O(D 3 ) (e.g., using Gauss-Jordan elimination) or O(D ) (recent theoretical advances) for matrix inversion Impractical for very large D or N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

61 Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

62 Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

63 Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

64 Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) 3 t t + 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

65 Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) 3 t t + 1 What is the complexity of each iteration? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

66 Why would this work? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

67 Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

68 Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Hessian of RSS RSS( w) = w T X T X w 2 ( X T y) T w + const 2 RSS( w) w w T = 2 X T X Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

69 Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Hessian of RSS RSS( w) = w T X T X w 2 ( X T y) T w + const 2 RSS( w) w w T = 2 X T X X T X is positive semidefinite, because for any v v T X T Xv = X T v Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

70 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

71 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

72 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

73 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

74 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 How does the complexity per iteration compare with gradient descent? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

75 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 How does the complexity per iteration compare with gradient descent? O(ND) for gradient descent versus O(D) for SGD Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

76 Mini-summary Batch gradient descent computes the exact gradient. Stochastic gradient descent approximates the gradient with a single data point; Its expectation equals the true gradient. Mini-batch variant: trade-off between accuracy of estimating gradient and computational cost Similar ideas extend to other ML optimization problems. For large-scale problems, stochastic gradient descent often works well. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

77 What if X T X is not invertible Why might this happen? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

78 What if X T X is not invertible Why might this happen? Answer 1: N < D. Intuitively, not enough data to estimate all parameters. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

79 What if X T X is not invertible Why might this happen? Answer 1: N < D. Intuitively, not enough data to estimate all parameters. Answer 2: Columns of X are not linearly independent, e.g., some features are perfectly correlated. In this case, solution is not unique. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

80 Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

81 Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: λ X T 0 λ X = U λ r 0 U where λ 1 λ 2 λ r > 0 and r < D. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

82 Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: λ X T 0 λ X = U λ r 0 U where λ 1 λ 2 λ r > 0 and r < D. Fix the problem by ensuring all singular values are non-zero X T X + λi = Udiag(λ 1 + λ, λ 2 + λ,, λ)u where λ > 0 and I is the identity matrix Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

83 Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

84 Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ 1 ( ) } T w T X T X w 2 X T y w λ w 2 2 }{{} regularization Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

85 Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( w) Benefits RSS( w) { { }}{ 1 ( ) } T w T X T X w 2 X T y w λ w 2 2 }{{} regularization Numerically more stable, invertible matrix Prevent overfitting more on this later Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

86 How to choose λ? λ is referred as hyperparameter In contrast w is the parameter vector Use validation set or cross-validation to find good choice of λ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

87 Outline 1 Administration 2 Review of last lecture 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

88 Is a linear modeling assumption always a good idea? Example of nonlinear classification Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

89 Is a linear modeling assumption always a good idea? Example of nonlinear classification Example of nonlinear regression t x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

90 Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x 1 x 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

91 Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x 1 x 2 Transformed training data: linearly separable! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

92 Another example How to transform the input/feature? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

93 Another example How to transform the input/feature? φ(x) : x R 2 z = x 2 1 x 1 x 2 x 2 2 R Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

94 Another example How to transform the input/feature? φ(x) : x R 2 z = x 2 1 x 1 x 2 x 2 2 R Transformed training data: linearly separable Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

95 General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). M could be greater than, less than, or equal to D Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

96 General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). M could be greater than, less than, or equal to D With the new features, we can apply our learning techniques to minimize our errors on the transformed training data linear methods: prediction is based on w T φ(x) other methods: nearest neighbors, decision trees, etc Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

97 Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

98 Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). The LMS solution can be formulated with the new design matrix φ(x 1 ) T φ(x 2 ) T Φ =. RN M, w lms = ( Φ T Φ ) 1 Φ T y φ(x N ) T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

99 Example with regression Polynomial basis functions 1 x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m=1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

100 Example with regression Polynomial basis functions 1 x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m=1 Fitting samples from a sine function: underrfitting as f(x) is too simple 1 M = 0 1 M = 1 t t x 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

101 Adding high-order terms M=3 t 1 M = x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

102 Adding high-order terms M=3 M=9: overfitting t 1 M = 3 t 1 M = x 1 0 x 1 More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

103 Overfiting Parameters for higher-order polynomials are very large M = 0 M = 1 M = 3 M = 9 w w w w w w w w w w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

104 Overfitting can be quite disastrous Fitting the housing price data with large M Predicted price goes to zero (and is ultimately negative) if you buy a big enough house! This is called poor generalization/overfitting. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

105 Detecting overfitting Plot model complexity versus objective function X axis: model complexity, e.g., M Y axis: error, e.g., RSS, RMS (square root of RSS), 0-1 loss ERMS Training Test M 6 9 As a model increases in complexity: Training error keeps improving Test error may first improve but eventually will deteriorate Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, / 39

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module 2 Lecture 05 Linear Regression Good morning, welcome

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016 CPSC 340: Machine Learning and Data Mining Gradient Descent Fall 2016 Admin Assignment 1: Marks up this weekend on UBC Connect. Assignment 2: 3 late days to hand it in Monday. Assignment 3: Due Wednesday

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Data Mining Techniques. Lecture 2: Regression

Data Mining Techniques. Lecture 2: Regression Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email:

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

1 Review of Winnow Algorithm

1 Review of Winnow Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm

More information