Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

Size: px
Start display at page:

Download "Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3."

Transcription

1 School of Computer Science Introduction to Machine Learning Linear Regression Readings: Bishop, 3.1 Murphy, 7 Matt Gormley Lecture 4 September 19,

2 Homework 1: due 9/26/16 Project Proposal: due 10/3/16 start early! Reminders 2

3 Outline Linear Regression Simple example Model Learning Gradient Descent SGD Closed Form Advanced opics Geometric and Probabilistic Interpretation of LMS L2 Regularization L1 Regularization Features Locally- Weighted Linear Regression Robust Regression 3

4 Outline Linear Regression Simple example Model Learning (aka. Least Squares) Gradient Descent SGD (aka. Least Mean Squares (LMS)) Closed Form (aka. Normal Equations) Advanced opics Geometric and Probabilistic Interpretation of LMS L2 Regularization (aka. Ridge Regression) L1 Regularization (aka. LASSO) Features (aka. non- linear basis functions) Locally- Weighted Linear Regression Robust Regression 4

5 Linear regression in 1D Our goal is to estimate w from a training data of <x i,y i > pairs Y y = wx + ε Optimization goal: minimize squared error (least squares): arg min w i ( y i wx i 2 ) X Why least squares? - minimizes squared distance between measurements and predicted line see HW - has a nice probabilistic interpretation - the math is pretty Slide courtesy of William Cohen

6 Solving Linear Regression in 1D o optimize closed form: We just take the derivative w.r.t. to w and set to 0: Slide courtesy of William Cohen w (y i wx i ) 2 = 2 x i (y i wx i ) i i 2 x i (y i wx i ) = 0 i i w = x i y i = wx i 2 i i i x i y i x i 2 2 x i y i 2 wx i x i = 0 i i

7 Linear regression in 1D Given an input x we would like to compute an output y In linear regression we assume that y and x are related with the following equation: What we are trying to predict y = wx+ε Observed values Y where w is a parameter and ε represents measurement error or other noise X Slide courtesy of William Cohen

8 Regression example Generated: w=2 Recovered: w=2.03 Noise: std=1 Slide courtesy of William Cohen

9 Regression example Generated: w=2 Recovered: w=2.05 Noise: std=2 Slide courtesy of William Cohen

10 Regression example Generated: w=2 Recovered: w=2.08 Noise: std=4 Slide courtesy of William Cohen

11 Bias term So far we assumed that the line passes through the origin What if the line does not? No problem, simply change the model to y = w 0 + w 1 x+ε Can use least squares to determine w 0, w 1 w 0 Y X w 0 = i y i n w x 1 i xi ( yi i w1 = 2 x i i w 0 ) Slide courtesy of William Cohen

12 Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. D = {x (i),y (i) } N i=1 where x R K and y R What are some example problems of this form? 12

13 Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. D = {x (i),y (i) } N i=1 where x R K and y R Prediction: Output is a linear function of the inputs. ŷ = h (x) = 1 x x K x K ŷ = h (x) = x (We assume x 1 is 1) Learning: finds the parameters that minimize some objective function. = argmin J( ) 13

14 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( ) We minimize the sum of the squares: J( )= 1 2 N i=1 ( x (i) y (i) ) 2 Why? 1. Reduces distance between true measurements and predicted hyperplane (line in 1D) 2. Has a nice probabilistic interpretation 14

15 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( ) We minimize the sum of the squares: his is a very general optimization J( )= setup. 1 2 We could solve it in N i=1 ( x (i) y (i) ) 2 Why? lots of ways. oday, 1. Reduces distance between true measurements and we ll consider three predicted hyperplane (line in 1D) 2. Has a ways. nice probabilistic interpretation 15

16 Least Squares Learning: hree approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 16

17 Learning as optimization he main idea today Make two assumptions: the classifier is linear, like naïve Bayes ie roughly of the form f w (x) = sign(x. w) we want to find the classifier w that fits the data best Formalize as an optimization problem Pick a loss function J(w,D), and find argmin w J(w) OR: Pick a goodness function, often Pr(D w), and find the argmax w f D (w) Slide courtesy of William Cohen 17

18 Learning as optimization: warmup Find: optimal (MLE) parameter θ of a binomial Dataset: D={x 1,,x n }, x i is 0 or 1, k of them are 1 P(D θ) = const θ x i (1 θ) 1 x i =θ k (1 θ) n k i d d P(D θ) = ( dθ dθ θ k (1 θ) n k ) " = d dθ θ % $ k '(1 θ) n k +θ k d # & dθ (1 θ)n k Slide courtesy of William Cohen = kθ k 1 (1 θ) n k +θ k (n k)(1 θ) n k 1 ( 1) =θ k 1 (1 θ) n k 1 k(1 θ) θ(n k) ( ) 18

19 Learning as optimization: warmup Goal: Find the best parameter θ of a binomial Dataset: D={x 1,,x n }, x i is 0 or 1, k of them are 1 = 0 θ= 0 Slide courtesy of William Cohen θ= 1 k- kθ nθ + kθ = 0 è nθ = k è θ = k/n 19

20 Learning as optimization with gradient ascent Goal: Learn the parameter w of Dataset: D={(x 1,y 1 ),(x n, y n) } Use your model to define Pr(D w) =. Set w to maximize Likelihood Usually we use numeric methods to find the optimum i.e., gradient ascent: repeatedly take a small step in the direction of the gradient Difference between ascent and descent is only the sign: I m going to be sloppy here about that. Slide courtesy of William Cohen 20

21 Gradient ascent o find argmin x f(x): Start with x 0 For t=1. x t+1 = x t + λ f (x t ) where λ is small Slide courtesy of William Cohen 21

22 Gradient descent Likelihood: ascent Loss: descent Slide courtesy of William Cohen 22

23 Pros and cons of gradient descent Simple and often quite effective on ML tasks Often very scalable Only applies to smooth functions (differentiable) Might find a local minimum, rather than a global one Slide courtesy of William Cohen 23

24 Pros and cons of gradient descent here is only one local optimum if the function is convex Slide courtesy of William Cohen 24

25 Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD(D, 2: (0) (0) ) 3: while not converged do 4: + J( ) 5: return In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives). J( )= d d 1 J( ) d d 2 J( ). d d N J( ) 25

26 Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD(D, 2: (0) (0) ) 3: while not converged do 4: + J( ) 5: return here are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance. J( ) 2 Alternatively we could check that the reduction in the objective function from one iteration to the next is small. 26

27 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: + J (i) ( ) 6: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per- example objective: Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. 27

28 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,...,K} do 6: k k + d d k J (i) ( ) 7: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per- example objective: Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. 28

29 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,...,K} do 6: k k + d d k J (i) ( ) 7: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm Let s start by calculating We need a per- example objective: this partial derivative for Let J( )= N i=1 J (i) the Linear Regression ( ) objective function. where J (i) ( )= 1 2 ( x (i) y (i) ) 2. 29

30 Partial Derivatives for Linear Reg. Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. d d k J (i) ( )= d 1 d k 2 ( x (i) y (i) ) 2 d d k ( x (i) y (i) ) 2 = 1 2 =( x (i) y (i) ) d ( x (i) y (i) ) d k =( x (i) y (i) ) d d k ( =( x (i) y (i) )x (i) k K k=1 kx (i) k y (i) ) 30

31 Partial Derivatives for Linear Reg. Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. d d k J (i) ( )= = 1 2 d 1 d k 2 ( x (i) y (i) ) 2 d ( x (i) d k y (i) ) 2 =( x (i) y (i) ) d d k ( x (i) y (i) ) =( x (i) y (i) ) d d k ( =( x (i) y (i) )x (i) k K k=1 kx (i) k y (i) ) 31

32 Partial Derivatives for Linear Reg. Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. d d k J (i) ( )=( x (i) d d k J( )= = d d k N i=1 N i=1 J (i) ( ) y (i) )x (i) k ( x (i) y (i) )x (i) k Used by SGD (aka. LMS) Used by Gradient Descent 32

33 Least Mean Squares (LMS) Algorithm 3 Least Mean Squares (LMS) 1: procedure LMS(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,...,K} do 6: k k + ( x (i) y (i) )x (i) k 7: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm 33

34 Least Squares Learning: hree approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 34

35 Background: Matrix Derivatives l For R m f : n! R, define: l race: A tra = A11 f ( A) = " A1 n i= 1 A ii, m f f! #! A 1n " A tra = a, mn f f tr ABC = trcab = trbca l Some fact of matrix derivatives (without proof) A trab = B, 1 traba C = CAB + C AB, A = A( A ) A A Eric CMU,

36 he normal equations l Write the cost function in matrix form: l o minimize J(θ), take derivative and set to zero: Eric CMU, ( ) ( ) ( ) y y X y y X X X y X y X y J n i i i!!!!!! + = = = = θ θ θ θ θ θ θ θ ) ( ) ( x = x n x x X!!! 2 1 = y n y y y! " 2 1 ( ) ( ) ( ) = = + = + = + = y X X X y X X X X X y y X y X X y y X y y X X X J!!!!!!!!! θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ tr tr tr tr y X X X! = θ he normal equations ( ) y X X X! 1 = * θ

37 Comments on the normal equation l In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that X X is necessarily invertible. l he assumption that X X is invertible implies that it is positive definite, thus the critical point we have found is a minimum. l What if X has less than full column rank? à regularization (later). Eric CMU,

38 Direct and Iterative methods l Direct methods: we can achieve the solution in a single step by solving the normal equation l l Using Gaussian elimination or QR decomposition, we converge in a finite number of steps It can be infeasible when data are streaming in in real time, or of very large amount l Iterative methods: stochastic or steepest gradient l l l Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate α Eric CMU,

39 Convergence rate l heorem: the steepest descent equation algorithm converge to the minimum of the cost characterized by normal equation: If l A formal analysis of LMS needs more math; in practice, one can use a small α, or gradually decrease α. Eric CMU,

40 Convergence Curves l For the batch method, the training MSE is initially large due to uninformed initialization l In the online update, N updates for every epoch reduces MSE to a much smaller value. Eric CMU,

41 Least Squares Learning: hree approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) pros: conceptually simple, guaranteed convergence cons: batch, often slow to converge Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) pros: memory efficient, fast convergence, less prone to local optima cons: convergence in practice requires tuning and fancier variants Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) pros: one shot algorithm! cons: does not scale to large datasets (matrix inverse is bottleneck) 41

42 Geometric Interpretation of LMS he predictions on the training data are:! ( ) 1 yˆ! = Xθ * = X X X X y Note that and X! yˆ! ( 1! y = X X I )y ( X X ) ( y!ˆ! y) = X X ( X X ) = ( ) 1! X I y ( X X X X ) 1 X X = 0!! ( ) ŷ! y! is the orthogonal projection of into the space spanned by the columns of X! y y y y =! 1 " 2 y n X =! x x! 1 2 x n! Eric CMU,

43 Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are related by the equation: where ε is an error term of unmodeled effects or random noise Now assume that ε follows a Gaussian N(0,σ), then we have: By independence assumption: i i i y ε θ + = x = σ θ πσ θ ) ( exp ) ; ( i i i i y x y p x = = = = σ θ πσ θ θ n i i i n n i i i y x y p L ) ( exp ) ; ( ) ( x 43 Eric CMU,

44 Probabilistic Interpretation of LMS Hence the log- likelihood is: n l( θ) = nlog = ( y 2 i 1 i θ xi ) 2πσ σ 2 2 Do you recognize the last term? Yes it is: n 1 J ( θ ) = ( xi θ yi ) 2 i= 1 hus under independence assumption, LMS is equivalent to MLE of θ! 2 Eric CMU,

45 Non-Linear basis function So far we only used the observed values x 1,x 2, However, linear regression can be applied in the same way to functions of these values Eg: to add a term w x 1 x 2 add a new variable z=x 1 x 2 so each example becomes: x 1, x 2,. z As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem y = w + w x + + w x k k + ε Slide courtesy of William Cohen

46 Non-linear basis functions What type of functions can we use? A few common examples: - Polynomial: φ j (x) = x j for j=0 n - Gaussian: - Sigmoid: φ j (x) = (x µ j ) 2σ j 2 φ j (x) = 1 1+ exp( s j x) Any function of the input values can be used. he solution for the parameters of the regression remains the same. - Logs: φ j (x) = log(x +1) Slide courtesy of William Cohen

47 General linear regression problem Using our new notations for the basis function linear regression can be written as y = n j= 0 w j φ j (x) Where φ j (x) can be either x j for multivariate regression or one of the non-linear basis functions we defined and φ 0 (x)=1 for the intercept term Slide courtesy of William Cohen

48 An example: polynomial basis vectors on a small dataset From Bishop Ch 1 Slide courtesy of William Cohen

49 0 th Order Polynomial Slide courtesy of William Cohen n=10

50 Slide courtesy of William Cohen 1 st Order Polynomial

51 Slide courtesy of William Cohen 3 rd Order Polynomial

52 Slide courtesy of William Cohen 9 th Order Polynomial

53 Over-fitting Root-Mean-Square (RMS) Error: Slide courtesy of William Cohen

54 Slide courtesy of William Cohen Polynomial Coefficients

55 Regularization Penalize large coefficient values J X,y (w) = 1 2 i # & % y i w j φ j (x i ) ( $ j ' 2 λ 2 w 2 Slide courtesy of William Cohen

56 Regularization: + Slide courtesy of William Cohen

57 Polynomial Coefficients none exp(18) huge Slide courtesy of William Cohen

58 Slide courtesy of William Cohen Over Regularization:

59 Example: Stock Prices Suppose we wish to predict Google s stock price at time t+1 What features should we use? (putting all computational concerns aside) Stock prices of all other stocks at times t, t- 1, t- 2,, t - k Mentions of Google with positive / negative sentiment words in all newspapers and social media outlets Do we believe that all of these features are going to be useful? 59

60 Ridge Regression Adds an L2 regularizer to Linear Regression J RR ( )=J( )+ 2 2 = 1 2 N i=1 ( x (i) y (i) ) 2 + K k=1 2 k Bayesian interpretation: MAP estimation with a Gaussian prior on the parameters MAP = argmax N i=1 = argmax J RR ( ) log p (y (i) x (i) ) + log p( ) where p( ) N (0, 1 ) 60

61 LASSO Adds an L1 regularizer to Linear Regression J LASSO ( )=J( )+ 1 = 1 2 N i=1 ( x (i) y (i) ) 2 + K k=1 k Bayesian interpretation: MAP estimation with a Laplace prior on the parameters MAP = argmax N i=1 = argmax J LASSO ( ) log p (y (i) x (i) ) + log p( ) where p( ) Laplace(0,f( )) 61

62 Ridge Regression vs Lasso X X Ridge Regression: Lasso: βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 βs with constant l1 norm β1 Lasso (l1 penalty) results in sparse solu3ons vector with more zero coordinates Good for high- dimensional problems don t have to store all coordinates! Eric CMU,

63 Optimization for LASSO Can we apply SGD to the LASSO learning problem? i=1 argmax J LASSO ( ) J LASSO ( )=J( )+ 1 = 1 2 N i=1 ( x (i) y (i) ) 2 + K k=1 k 63

64 Optimization for LASSO Consider the absolute value function: r( )= K k=1 k he L1 penalty is subdifferentiable (i.e. not differentiable at 0) Where does a small step opposite the gradient accomplish far from zero? close to zero? 64

65 Optimization for LASSO he L1 penalty is subdifferentiable (i.e. not differentiable at 0) An array of optimization algorithms exist to handle this issue: Coordinate Descent Othant- Wise Limited memory Quasi- Newton (OWL- QN) (Andrew & Gao, 2007) and provably convergent variants Block coordinate Descent (seng & Yun, 2009) Sparse Reconstruction by Separable Approximation (SpaRSA) (Wright et al., 2009) Fast Iterative Shrinkage hresholding Algorithm (FISA) (Beck & eboulle, 2009) 65

66 Data Set Size: 9 th Order Polynomial Slide courtesy of William Cohen

67 Locally weighted linear regression l he algorithm: Instead of minimizing now we fit θ to minimize Where do w i 's come from? l n 1 2 J ( θ ) = ( xi θ yi ) 2 i= 1 n 1 J ( θ ) = w i ( xi θ yi ) 2 i= 1 ( exp x w = i i 2 x) 2τ where x is the query point for which we'd like to know its corresponding y 2 2 à Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query) Eric CMU,

68 Parametric vs. non-parametric l l l Locally weighted linear regression is the second example we are running into of a non-parametric algorithm. (what is the first?) he (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm l l l because it has a fixed, finite number of parameters (the θ), which are fit to the data; Once we've fit the θ and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. he term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set. Eric CMU,

69 Robust Regression l he best fit from a quadratic regression l But this is probably better How can we do this? Eric CMU,

70 LOESS-based Robust Regression l Remember what we do in "locally weighted linear regression"? à we "score" each point for its impotence l Now we score each point according to its "fitness" (Courtesy to Andrew Moore) Eric CMU,

71 Robust regression l For k = 1 to R l Let (x k,y k ) be the kth datapoint l l Let y est k be predicted value of y k Let w k be a weight for data point k that is large if the data point fits well and small if it fits badly: w k = φ est 2 ( y y ) ) k k l hen redo the regression using weighted data points. l Repeat whole thing until converged! Eric CMU,

72 Robust regression probabilistic interpretation l What regular regression does: Assume y k was originally generated using the following recipe: yk = θ xk + N ( 0, σ 2 ) Computational task is to find the Maximum Likelihood estimation of θ Eric CMU,

73 Robust regression probabilistic interpretation l What LOESS robust regression does: Assume y k was originally generated using the following recipe: with probability p: yk = θ xk + N ( 0, σ 2 ) but otherwise y k 2 ~ N ( µ, σ huge) Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σ huge. l he algorithm you saw with iterative reweighting/refitting does this computation for us. Later you will find that it is an instance of the famous E.M. algorithm Eric CMU,

74 Summary Linear regression predicts its output as a linear function of its inputs Learning optimizes a function (equivalently likelihood or mean squared error) using standard techniques (gradient descent, SGD, closed form) Regularization enables shrinkage and model selection Eric CMU,

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Jelinek Summer School Machine Learning

Jelinek Summer School Machine Learning Jelinek Summer School Machine Learning Department School of Computer Science Carnegie Mellon University Machine Learning Matt Gormley June 19, 2017 1 This section is based on Chapter 1 of (Mitchell, 1997)

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on UVA CS 45 - / 65 7 Introduc8on to Machine Learning and Data Mining Lecture 6: Regression Models with Regulariza8on Yanun Qi / Jane University of Virginia Department of Computer Science Last Lecture Recap

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Perceptron & Neural Networks

Perceptron & Neural Networks School of Computer Science 10-701 Introduction to Machine Learning Perceptron & Neural Networks Readings: Bishop Ch. 4.1.7, Ch. 5 Murphy Ch. 16.5, Ch. 28 Mitchell Ch. 4 Matt Gormley Lecture 12 October

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch. School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

LINEAR REGRESSION, RIDGE, LASSO, SVR

LINEAR REGRESSION, RIDGE, LASSO, SVR LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia UVA CS 4501: Machine Learning Lecture 6: Linear Regression Model with Regulariza@ons Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sec@ons of this course

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information