CS60021: Scalable Data Mining. Large Scale Machine Learning

Size: px

Start display at page:

Download "CS60021: Scalable Data Mining. Large Scale Machine Learning"

Vincent Carter
5 years ago
Views:

1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya

Example: Spam filtering Instance space x Î X ( X = n data points) Binary or real-valued feature vector x of word occurrences d features (words + other things, d~100,000)

2 Example: Spam filtering Instance space x Î X ( X = n data points) Binary or real-valued feature vector x of word occurrences d features (words + other things, d~100,000) Class y Î Y y: Spam (+1), Ham (-1) Goal: Estimate a function f(x) so that y = f(x) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2

3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3 Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x vector of binary, categorical, real valued features y class ({+1, -1}, or a real number)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 Task: Given data (X,Y) build a model f() to predict Y based on X Strategy: Estimate y = f x on (X, Y).

4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 4 Task: Given data (X,Y) build a model f() to predict Y based on X Strategy: Estimate y = f x on (X, Y). Hope that the same f(x) also works to predict unknown Y Training data Test data The hope is called generalization Overfitting: If f(x) predicts well Y but is unable to predict Y We want to build a model that generalizes well to unseen data But Jure, how can we well on data we have never seen before?!? X Y X Y

5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Idea: Pretend we do not know the data/labels we actually do know Build the model f(x) on the training data See how well f(x) does on the test data If it does well, then apply it also to X Refinement: Cross validation Training set Validation set Test set Splitting into training/validation set is brutal Let s split our data (X,Y) into 10-folds (buckets) Take out 1-fold for validation, train on remaining 9 Repeat this 10 times, report average performance X X Y

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.

..w (d) x (d) ³ q -1 otherwise Input: Vectors x j and labels y j Vectors x j are real valued where x 2 = 1

6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Binary classification: f (x) = { +1 if w (1) x (1) + w (2) x (2) +...w (d) x (d) ³ q -1 otherwise Input: Vectors x j and labels y j Vectors x j are real valued where x 2 = 1 Goal: Find vector w = (w (1), w (2),..., w (d) ) Each w (i) is a real number w x = w Note: x w Decision boundary is linear x,1 w, -q " x

7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7 min w, b 1 2 s. t. " i, y i ( x i w Want to estimate w and b! Standard way: Use a solver! Solver: software for finding solutions to common optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data! n w w + C å i= 1 + b) x i ³ 1- x i

8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 Want to estimate w, b! Alternative approach: Want to minimize f(w,b): f ( w, b) = min w, b s. t. " i, n ì d 1 å å ( j) ( j) 2 w w + C maxí0,1 - yi ( w xi + b ) = î = þ ýü i 1 j y i ( x i n w w + Cåxi i= 1 w + b) ³ 1-x i Side note: How to minimize convex functions g(z)? Use gradient descent: min z g(z) Iterate: z t+1 z t h Ñg(z t ) g(z) z

9 9 Find the minimum or maximum of an objective function given a set of constraints: arg minf 0 (x) x s.t.f i (x) apple 0,i= {1,...,k} h j (x) =0,j = {1,...l}

10 10 Linear Classification Maximum Likelihood nx nx arg min w 2 + C w s.t. 1 i=1 y i x T i w apple i i=1 i arg max nx i=1 log p (x i ) i 0 K-Means arg min J(µ) = µ 1,µ 2,...,µ k kx X j=1 i2c j x i µ j 2

11 11 Local (non global) minima and maxima: Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall /

12 12 A function f : R n! R is convex if for x, y 2 domfand any a 2 [0, 1], f(ax +(1 a)y) apple af(x)+(1 a)f(y) f(y) αf(x)+(1-α)f(y) AsetC R n is convex if for x, y 2 C and any a 2 [0, 1], f(x) ax +(1 a)y 2 C x y

13 13 3 SVM loss: f(w) = [ 1 y i x T i w ] + [1 - x] + Binary logistic loss: f(w) = log ( 1+exp( y i x T i w) ) log(1 + e x ) 0 2 3

14 14 minimize x s.t. f 0 (x) (Convex function) f i (x) 0 (Convex sets) h j (x) =0 (Affine)

15 15

16 16 The simplest algorithm in the world (almost). Goal: minimize x f(x) Just iterate x t+1 = x t η t f(x t ) where η t is stepsize.

17 17 f(x) (x t,f(x t )) f(x t )-η f(x t ) T (x - x t )

18 18

19 19 Idea: use a second-order approximation to function. f(x + x) f(x)+ f(x) T x xt 2 f(x) x Choose x to minimize above: x = [ 2 f(x) ] 1 f(x) This is descent direction: Inverse Hessian Gradient f(x) T x = f(x) T [ 2 f(x) ] 1 f(x) < 0.

20 20 ˆf f (x, f(x)) (x + x, f(x + x)) ˆf is 2 nd -order approximation, f is true function.

21 21 } Lots of non-differentiable convex functions used in machine learning: f(y) (x, f(x)) f(x)+g T (y - x) The subgradient set, orsubdifferential set, f(x) of f at x is f(x) = { g : f(y) f(x)+g T (y x) for all y }.

22 22 Really, the simplest algorithm in the world. Goal: minimize x f(x) Just iterate x t+1 = x t η t g t where η t is a stepsize, g t f(x t ).

23 23 Goal of machine learning : Minimize expected loss given samples This is Stochastic Optimization Assume loss function is convex

24 24 Process all examples together in each step Entire training set examined at each step Very slow when n is very large

25 25 Optimize one example at a time Choose examples randomly (or reorder and choose in order) Learning representative of example distribution

26 Equivalent to online learning (the weight vector w changes with every example) Convergence guaranteed for convex functions (to local minimum) 26

27 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 27 Gradient descent: Iterate until convergence: For j = 1 d j) f ( w, b) Evaluate: Ñf = ( j) Update: w w (j) w (j) - hñf (j) n ( ( j) L( x = + å i, yi ) w C ( j) i= 1 w Problem: Computing Ñf (j) takes O(n) time! n size of the training dataset h learning rate parameter C regularization parameter

28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 28 Stochastic Gradient Descent Instead of evaluating gradient over all examples evaluate it for each individual training example j) ( j) L( xi, y Ñf ( xi ) = w + C ( j) w Stochastic gradient descent: ( i ) Iterate until convergence: For i = 1 n For j = 1 d Compute: Ñf (j) (x i ) Update: w (j) w (j) - h Ñf (j) (x i ) Ñf We just had: n ( j) ( j) L( xi, yi ) = w + Cå ( j) i= 1 w Notice: no summation over i anymore

29 29 Convergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling) Might need to decrease with time to ensure the algorithm converges eventually Basically SGD good for machine learning with large data sets!

30 30 Stochastic 1 example per iteration Batch All the examples! Sample Average Approximation (SAA): Sample m examples at each step and perform SGD on them Allows for parallelization, but choice of m based on heuristics

31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 31 Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification n = 781,000 training examples (documents) 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words

32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 32 Questions: (1) Is SGD successful at minimizing f(w,b)? (2) How quickly does SGD find the min of f(w,b)? (3) What is the error on a test set? Standard SVM Fast SVM SGD SVM Training time Value of f(w,b) Test error (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

33 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 33 SGD SVM Conventional SVM Optimization quality: f(w,b) f (w opt,b opt ) For optimizing f(w,b) within reasonable quality SGD-SVM is super fast

linear time k. Conjugate gradient converges in k.

34 SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time k. Conjugate gradient converges in k. Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times k condition number J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 34

35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 35 Need to choose learning rate h and t 0 ht æ L( xi, yi ) ö wt + 1 wt - ç wt + C t + t0 è w ø Leon suggests: Choose t 0 so that the expected initial updates are comparable with the expected size of the weights Choose h: Select a small subsample Try various rates h (e.g., 10, 1, 0.1, 0.01, ) Pick the one that most reduces the cost Use h for next 100k iterations on the full dataset

36 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 36 Sparse Linear SVM: Feature vector x i is sparse (contains many zeros) Do not do: x i = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0, ] But represent x i as a sparse vector x i =[(4,1), (9,5), ] Can we do the SGD update more efficiently? è w ø + - C w w w hç Approximated in 2 steps: w L( xi, yi ) w -hc w w w( 1-h) æ y x L i ), ( ö cheap: x i is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated i

37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 37 Solution 1: w = s v Represent vector w as the product of scalar s and vector v Then the update procedure is: Two step update procedure: (1) (2) L( xi, yi ) w w -hc w w w( 1-h) (1) v = v ηc L x i,y i w (2) s = s(1 η) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher h

38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 38 Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create a validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset

39 Reference: Given dataset D = { x C, y C } Loss function: L θ, D l(θ; x G KL@ K, y K ) For linear models: l θ; x K, y K = l(y K, θ M φ x K ) Assumption D is drawn IID from some distribution P. Problem: min L(θ, D) S G

40 Input: D Output: θ Algorithm: Initialize θ V For t = 1,, T θ Z[@ = θ Z η Z S l(y Z, θ M φ x Z ) θ = ` _ab ^_S _. ` _ab ^_

41 Expected loss: s θ = E P [l(y, θ M φ x ] Optimal Expected loss: s = s θ = min s(θ) S Convergence: E S h s θ s Rk + L k k ZL@ η Z 2 M ZL@ η Z Where: R = θ V θ L = max l(y, θ M φ x ) M

42 Define r Z = θ Z θ and g Z = S l(y Z, θ M φ x Z ) k r Z[@ = r k Z + η k Z g k Z 2η Z θ Z θ M g Z Taking expectation w.r.t P, θ and using s s θ Z g M Z (θ θ Z ), we get: k E S h r Z[@ r k Z η k Z L k + 2η Z (s E S h[s θ Z ]) Taking sum over t = 1,, T and using k k E S h r Z[@ r V Mt@ Mt@ L k k s η Z + 2 s η Z (s E S h[s θ Z ]) ZLV ZLV

43 Using convexity of s: s η Z ZLV E S h s θ Mt@ E S h[s η Z s θ Z ZLV Substituting in the expression from previous slide: k k E S h r Z[@ r V Mt@ L k s η Z k ZLV Mt@ + 2 s η Z (s E S h[s θ ]) ZLV Rearranging the terms proves the result. ]

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad