Machine Learning. Lecture 2: Linear regression. Feng Li.

Size: px

Start display at page:

Download "Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io"

Marilyn White
6 years ago
Views:

1 Machine Learning Lecture 2: Linear regression Feng Li School of Computer Science and Technology Shandong University Fall 2017

2 Supervised Learning Regression: Predict a continuous value Classification: Predict a discrete value, the class Living area (feet 2 ) Price (1000$s) x (living area of house.) Training set Learning algorithm h predicted y (predicted price) of house) Features: input variables, x; Target: output variable, y; Training example: (x (i), y (i) ), i = 1, 2, 3,..., m Hypothesis: h : X Y. 2 / 22

3 Linear Regression housing prices price (in $1000) square feet Linear hypothesis: h(x) = θ 1 x + θ 0. θ i (i = 1, 2 for 2D cases): Parameters to estimate. How to choose θ i s? 3 / 22

4 Linear Regression (Contd.) Input: Training set (x (i), y (i) ) R 2 (i = 1,..., m) Goal: Model the relationship between x and y such that we can predict the corresponding target according to a given new feature. 4 / 22

5 Linear Regression (Contd.) The relationship between x and y is modeled as a linear function. The linear function in the 2D plane is a straight line. Hypothesis: h θ (x) = θ 0 + θ 1 x (where θ 0 and θ 1 are parameters) 5 / 22

6 Linear Regression (Contd.) Given data x R n, we then have θ R n+1 Thus h θ (x) = n i=0 θ ix i = θ T x, where x 0 = 1 What is the best choice of θ? min θ J(θ) = 1 2 where J(θ) is so-called a cost function m (h θ (x (i) ) y (i) ) 2 i=1 6 / 22

7 Gradient Descent (GD) Algorithm If the multi-variable function J(θ) is differentiable in a neighborhood of a point θ, then J(θ) decreases fastest if one goes from θ in the direction of the negative gradient of J at θ Find a local minimum of a differentiable function using gradient descent where α is so-called learning rate Calculating the gradient J(θ) = 1 θ j θ j 2 i=1 θ j θ j α J(θ) θ j, j (1) m m (θ T x (i) y (i) ) 2 = (θ T x (i) y (i) )x (i) j Algorithm: Iteratively update θ according to (1) until convergence is achieved Note: θ is usually initialized randomly i=1 7 / 22

8 Gradient Descent Algorithm An illustration of gradient descent algorithm The objective function is decreased fastest along the gradient 8 / 22

9 Remarks Another commonly used form min θ J(θ) = 1 2m m (h θ (x (i) ) y (i) ) 2 What s the difference? m is introduced to scale the objective function May be useful in implementations In gradient descent algorithm, we can scale the step size instead Gradient ascent algorithm Maximize the differentiable function J(θ) The gradient represents the direction along which J increases fastest Therefore, we have θ j θ j + α J(θ) θ j i=1 9 / 22

10 Objective function J Convergence under Different Step Sizes 0.6, = 0.06, = 0.07, = Iterations 10 / 22

11 Stochastic Gradient Descent (SGD) What if the training set is huge? In the above batch gradient descent algorithm, we have to run through the entire training set in each iteration A considerable computation cost is induced! Stochastic gradient descent (SGD), also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization method In each iteration, the parameters are updated according the gradient of the error with respect to one training sample only Algorithm: Randomly shuffle the samples in the training set For i = 1, 2,..., N, θ θ α(θ T x (i) y (i) )x (i) j 11 / 22

12 More About SGD The objective does not always decrease for each iteration Usually, SGD has θ approaching the minimum much faster than batch GD SGD may never converge to the minimum, and oscillating may happen A variants: Mini-batch, say pick up a small group of samples and do average, which may accelerate and smoothen the convergence 12 / 22

13 Matrix Derivatives A function f : R m n R The derivative of f with respect to A is defined as f A 11 f(a) =..... f A m1 For an n n matrix, its trace is defined as tra = n i=1 A ii f A n f A mn trabcd = trdabc = trcdab = trbcda tra = tra T, tr(a + B) = tra + trb, traa = atra AtrAB = B T, A T f(a) = ( Af(A)) T AtrABA T C = CAB + C T AB T, A A = A (A 1 ) T 13 / 22

14 Matrix Derivatives (Contd.) A trab = B T and A T f(a) = ( A f(a)) T The derivative of f with respect to A is defined as f A 11 f(a) = f A m1 For an n n matrix, its trace is defined as tra = n i=1 A ii f A n f A mn trabcd = trdabc = trcdab = trbcda tra = tra T, tr(a + B) = tra + trb, traa = atra 14 / 22

15 Revisiting Least Square Assume X = (x (1) ) T. (x (m) ) T Y = Therefore, we have (x (1) ) T θ (y (1) ) Xθ Y =.. x (m) ) T θ y (m) ) = (y (1) ). (y (m) ) J(θ) = 1 2 m i 1 (h θ(x (i) ) y (i) ) = 1 2 (Xθ Y )T (Xθ Y ) (h θ (x (1) ) y (1). h θ (x (m) ) y (m) 15 / 22

16 Revisiting Least Square (Contd.) Minimize J(θ) = 1 2 (Y Xθ)T (Y Xθ) Calculate its derivatives with respect to θ θ J(θ) = θ 1 2 (Y Xθ)T (Y Xθ) = 1 2 θ(y T θ T X T )(Y Xθ) = 1 2 θtr(y T Y Y T Xθ θ T X T Y + θ T X T Xθ) = 1 2 θtr(θ T X T Xθ) X T Y = 1 2 (XT Xθ + X T Xθ) X T Y = X T Xθ X T Y Tip: A T traba T C = B T A T C T + BA T C 16 / 22

17 Revisiting Least Square (Contd.) Theorem: The matrix A T A is invertible if and only if the columns of A are linearly independent. In this case, there exists only one least-squares solution θ = (X T X) 1 X T Y Prove the above theorem in Problem Set / 22

18 Probabilistic Interpretation The target variables and the inputs are related y (i) = θ T x (i) + ɛ (i) ɛ (i) s denote the errors and are independently and identically distributed (i.i.d.) according to a Gaussian distribution N (0, σ 2 ) ) The density of ɛ (i) is given by p(ɛ (i) ) = 1 2πσ exp ( (ɛ(i) ) 2 2σ 2 ) Equivalently, p(y (i) x (i) ; θ) = 1 2πσ exp ( (y(i) θ T x (i) ) 2 2σ 2 The distribution of y i given x i parameterized by θ y (i) x (i) ; θ N (θ T x (i), σ 2 ) Since Y = Xw, what is the distribution of Y given X and θ? The probability of the data is given by p(y X; θ) Likehood function: L(θ) = L(θ; X, Y ) = p(y X; θ) Since ɛ (i) s are i.i.d., L(θ) = i p(y(i) x (i) ; θ) = i ) 1 2πσ exp ( (y(i) θ T x (i) ) 2 2σ 2 18 / 22

19 Probabilistic Interpretation (Contd.) Maximizing the likelihood L(θ) Choosing the optimal θ to make the data as high probability as possible Since L(θ) is complicated, we maximize an increasing function of L(θ) instead l(θ) = log L(θ) m 1 = log exp ( (y(i) θ T x (i) ) 2 ) i 2πσ 2σ 2 m 1 = log exp ( (y(i) θ T x (i) ) 2 ) 2πσ 2σ 2 i = m log 1 1 2πσ 2σ 2 (y (i) θ T x (i) ) 2 Apparently, maximizing L(θ) (thus l(θ)) is equivalent to minimizing i 1 2 m (y (i) θ T x (i) ) 2 i 19 / 22

20 Locally Weighted Linear Regression (LWR) Underfitting and overfitting Example: y = 1 j=0 θjxj, y = 2 j=0 θjxj, and y = 5 j=0 θjxj Choosing features is important for guaranteeing the performance of learning algorithms For linear regression, what can we do? LWR: Fit θ to minimize i δi(y(i) θ T x (i) ) 2, and then output θ T x The δ i s are non-negative valued weights ) δ i = exp ( (x(i) x) 2 2τ 2 If x (i) x is small, δ i is close to 1, and vice versa. 20 / 22

21 Some Remarks LWR is a non-parametric algorithm, where we need to keep the entire training set around, while the (unweighted) linear regression is known as a parametric learning algorithm, as it has a fixed, finite number of parameters (i.e., the θ s) In parametric algorithm, once the parameters are determined, there is no need to keep the training set around; nevertheless, in non-parametric algorithm, we have to perform the learning algorithm for each new input. 21 / 22

22 Thanks! Q & A 22 / 22

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from