Big Data Analytics. Lucas Rego Drumond

Size: px

Start display at page:

Download "Big Data Analytics. Lucas Rego Drumond"

Kelley Russell
5 years ago
Views:

1 Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34

2 Outline 0. Review 2. Does more data help? Predictive Models 1 / 34

3 0. Review Outline 0. Review 2. Does more data help? Predictive Models 1 / 34

4 0. Review Prediction Problem: Formally Let X be any set (the predictor space), Y be any set (the target space) Task: Given: some training data D train X Y a loss function l : Y Y R that measures how bad is a prediction ŷ if the true value is y compute a prediction function: ŷ : X Y such that the empirical risk is minimum: risk(ŷ, D test ) := 1 D test (x,y) D test l(y, ŷ(x)) Predictive Models 1 / 34

5 0. Review Regularization Now estimating the parameters β is done by solving the following optimization task: arg min β (x,y) D train l(y, ŷ(x; β)) + λr(β) When solving a prediction task we need to define the following components: A prediction function ŷ(x; β) A loss function l(y, ŷ(x; β)) A regularization function R(β) A learning algorithm to solve the optimization task above. Predictive Models 2 / 34

6 Outline 0. Review 2. Does more data help? Predictive Models 3 / 34

7 Learning algorithms Learning a model means estimating the parameters ˆβ that minimize the loss function on the training data: ˆβ := arg min l(y, ŷ(x; β)) + λr(β) β (x,y) D train for a fixed λ R + 0 Today we will see four different approaches to this: Computing the closed form solution Gradient Descent Stochastic Gradient Descent Newton s Method Predictive Models 3 / 34

8 Closed Form Solution Be f a continuous differentiable convex function, its minimum is obtained at the point x dom f : f (x) = 0 Thus, finding the closed form solution for the learning problem ˆβ := arg min β (x,y) D train l(y, ŷ(x; β)) + λr(β) can be done by solving the following equation for β: β (x,y) D train l(y, ŷ(x; β)) + λ β R(β) = 0 Predictive Models 4 / 34

9 Practical Example: Household Spending If we have data about m instances, each with n features, we can represent it as: x 1,1 x 1,2 x 1,n y 1 x 2,1 x 2,2 x 2,n X m,n = y = y 2. x m,1 x m,2 x m,n y m Now let us assume we use a linear model ŷ(x i ) as a prediction function Example: ŷ(x i ) = β T x i = β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 Predictive Models 5 / 34

10 Closed Form Solution: Ridge Regression Let us take the following example: Loss function: l(y, ŷ(x; β)) = (ŷ(x; β) y) 2 Regularization: R(β) = β 2 2 Prediction function: ŷ(x; β) = x T β This is often called the Ridge Regression: ˆβ := arg min β which can be rewritten as: (x,y) D train (x T β y) 2 + λ β 2 2 ˆβ := arg min X β y λ β 2 2 β Predictive Models 6 / 34

11 Closed Form Solution: Ridge Regression ˆβ := arg min X β y λ β 2 2 β The closed form solution is computed as: β ( X β y λ β 2 2) = 0 2X T (X ˆβ y) + 2λ ˆβ = 0 X T X ˆβ X T y + λ ˆβ = 0 X T X ˆβ + λ ˆβ = X T y ˆβ = (X T X + λi) 1 X T y Predictive Models 7 / 34

12 Closed Form Solution: Ridge Regression The optimal parameters for the Ridge Regression are given by ˆβ = (X T X + λi) 1 X T y or alternatively by solving the following system of equations: However... (X T X + λi) ˆβ = X T y Computing the closed form solution is not trivial (or even feasible) for other problem settings For other cases than the Ridge Regression we need to find the solution numerically Predictive Models 8 / 34

13 Logistic Regression Let us look at another example: Loss function: l(y, ŷ(x; β)) = y log ŷ(x; β) (1 y) log(1 ŷ(x; β)) Regularization: R(β) = β 2 2 Prediction function: ŷ(x; β) = logistic(x T β) = 1 1+e xt β This is a classification approach called the Logistic Regression: ˆβ := arg min β (x,y) D train y log ŷ(x; β) (1 y) log(1 ŷ(x; β)) + λ β 2 2 A closed form solution will not work! Predictive Models 9 / 34

14 Descent Methods 1: procedure DescentMethod input: f Choose an initial point β (0) R n 2: Get initial point β (0) 3: t 0 The next point is generated 4: repeat using 5: Get Update Direction β (t) A step size µ 6: Get Step Size µ A direction β such that 7: β (t+1) β (t) + µ β (t) 8: t t + 1 f (β (t) + µ β (t 1) ) < f (β (t 1) ) 9: until convergence 10: return β, f (β) 11: end procedure Predictive Models 10 / 34

15 Gradient Descent gradient of a function f : R R n in β shows the direction to which the function is maximally growing at point β Gradient Descent is a descent algorithm that searches in the opposite direction of the gradient β = f (β) Predictive Models 11 / 34

16 Gradient Descent 1: procedure GradientDescent input: f, step size µ, stopping criterion ɛ 2: Get initial point β 3: repeat 4: β := β µ f (β) 5: until f (β) < ɛ 6: return β, f (β) 7: end procedure Predictive Models 12 / 34

17 Gradient Descent: Computing the Gradients f (β) = Based on the chain rule, we can define: f (β) β = (x,y) D train l(y, ŷ(x; β)) + λr(β) (x,y) D train ŷ l(y, ŷ) ŷ β + λ R(β) β Predictive Models 13 / 34

18 Logistic Regression: Computing the Gradients f (β) β = For the Logistic Regression: (x,y) D train ŷ l(y, ŷ) ŷ β + λ R(β) β Loss function: l(y, ŷ(x; β)) = y log ŷ(x; β) (1 y) log(1 ŷ(x; β)) Regularization: R(β) = β 2 2 = βt β Prediction function: ŷ(x; β) = logistic(x T 1 β) = 1+e xt β ŷ β = logistic(xt β) (1 logistic(x T β))x = ŷ(1 ŷ)x R(β) β = 2β Predictive Models 14 / 34

19 Logistic Regression: Computing the Gradients f (β) β = For the Logistic Regression: (x,y) D train l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ŷ l(y, ŷ) ŷ β + λ R(β) β ŷ l(y, ŷ) = y 1 1 (1 y) ŷ 1 ŷ = ŷ y ŷ(1 ŷ) Predictive Models 15 / 34

20 Logistic Regression: Computing the Gradients f (β) β = (x,y) D train ŷ l(y, ŷ) ŷ β + λ R(β) β Substituting the results from the previous slides: f (β) β = (x,y) D train = ŷ y ŷ(1 ŷ)x + λ2β ŷ(1 ŷ) (x,y) D train (ŷ y) x + 2λβ Predictive Models 16 / 34

21 Gradient Descent - Considerations Stopping criterion: f (β) 2 ɛ Simple and straightforward Usually slow convergence Works only well for convex problems, otherwise gets stuck in local minima Rarely used on practice Predictive Models 17 / 34

22 Newton s Step Be f : R n R a twice differentiable convex function Newton s step uses the inverse of the Hessian matrix 2 f (β) 1 and the gradient f (β) Newton β = 2 f (β) 1 f 0 (β) In practice the Hessian is never inverted. The step is computed by solving the following system of equations: 2 f (β) Newton β = f 0 (β) Predictive Models 18 / 34

23 Newton s method The Newton s method can be then rewritten without the inverse of the Hessian as the follows: Repeat until convergence: 1. Solve 2 f (β) β = f (β) for β 2. Get step size µ (line search) 3. Update β : β β + µ β Predictive Models 19 / 34

24 Newton s method 1: procedure Newtons Method input: f, 2: Get initial point β 3: repeat 4: β Solve 2 f (β) β = f (β) 5: Get Step Size µ 6: β β + µ β 7: until convergence 8: return β, f (β) 9: end procedure Predictive Models 20 / 34

25 Logistic Regression: Computing the Newton Step f (β) = (ŷ y) x + 2λβ (x,y) D train The Hessian 2 f (β) is a matrix where each cell is given by: 2 f (β) β i β j = 2 f (β) β 2 i (x,y) D train x i x j ŷ(1 ŷ) = (x,y) D train x 2 i ŷ(1 ŷ) + 2λ Predictive Models 21 / 34

26 Stochastic Gradiend Descent If we can rewrite the objective function as a big sum: f (β) = m f i (β) i=1 f i (β) = l(y i, ŷ(x i ; β)) + λ m R(β) We can define the following update rule Pick a random instance i Uniform(1, m) Update β β β + µ ( β f i (β)) Predictive Models 22 / 34

27 Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m (in a random order) do 5: β β µ f i (β) 6: end for 7: until convergence 8: return β, f (β) 9: end procedure Predictive Models 23 / 34

28 Logistic Regression: SGD Update f i (β) = y i log ŷ(x i ; β) (1 y i ) log(1 ŷ(x i ; β)) + λ m β 2 2 The gradient for the update rule is given by: f i (β) = (ŷ(x i ; β) y i ) x i + 2 λ m β Predictive Models 24 / 34

29 Accelerating SGD: AdaGrad We have f (β) = m f i (β) i=1 Update rule: Pick a random instance i Uniform(1, m) Compute the gradient β f i (β) Update the gradient history h h + β f i (β) β f i (β) The step size for parameter β i is Update denotes the elementwise product µ hi β β µ h ( β f i (β)) Predictive Models 25 / 34

30 SGD with Adagrad 1: procedure AdaGradSGD input: f, µ 2: Get initial point β 3: h 0 4: repeat 5: for i 1,..., m do 6: h h + β f i (β) β f i (β) 7: β β µ h f i (β) 8: end for 9: until convergence 10: return β, f (β) 11: end procedure Predictive Models 26 / 34

31 Real World Dataset: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm)... Predictive Models 27 / 34

32 Real World Dataset: Body Fat prediction The data is represented it as: 1 x 1,1 x 1,2... x 1,n y 1 1 x 2,1 x 2,2... x 2,n X m,n =..... y = y 2. 1 x m,1 x m,2... x m,n y m with m = 252, n = 14 We can model the percentage of body fat y is a linear combination of the body measurements with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i, β n x i,n Predictive Models 28 / 34

33 SGD vs GD - Body Fat Dataset SGD vs GD MSE SGD GD Iterations Predictive Models 29 / 34

34 Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data Predictive Models 30 / 34

35 GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e Iterations Predictive Models 31 / 34

36 SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Predictive Models 32 / 34

37 AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Predictive Models 33 / 34

38 AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD Iterations Predictive Models 34 / 34

39 2. Does more data help? Outline 0. Review 2. Does more data help? Predictive Models 35 / 34

Modern Optimization Techniques

Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic