CS60021: Scalable Data Mining. Large Scale Machine Learning

Size: px
Start display at page:

Download "CS60021: Scalable Data Mining. Large Scale Machine Learning"

Transcription

1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya

2 Example: Spam filtering Instance space x Î X ( X = n data points) Binary or real-valued feature vector x of word occurrences d features (words + other things, d~100,000) Class y Î Y y: Spam (+1), Ham (-1) Goal: Estimate a function f(x) so that y = f(x) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2

3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3 Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x vector of binary, categorical, real valued features y class ({+1, -1}, or a real number)

4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 4 Task: Given data (X,Y) build a model f() to predict Y based on X Strategy: Estimate y = f x on (X, Y). Hope that the same f(x) also works to predict unknown Y Training data Test data The hope is called generalization Overfitting: If f(x) predicts well Y but is unable to predict Y We want to build a model that generalizes well to unseen data But Jure, how can we well on data we have never seen before?!? X Y X Y

5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Idea: Pretend we do not know the data/labels we actually do know Build the model f(x) on the training data See how well f(x) does on the test data If it does well, then apply it also to X Refinement: Cross validation Training set Validation set Test set Splitting into training/validation set is brutal Let s split our data (X,Y) into 10-folds (buckets) Take out 1-fold for validation, train on remaining 9 Repeat this 10 times, report average performance X X Y

6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Binary classification: f (x) = { +1 if w (1) x (1) + w (2) x (2) +...w (d) x (d) ³ q -1 otherwise Input: Vectors x j and labels y j Vectors x j are real valued where x 2 = 1 Goal: Find vector w = (w (1), w (2),..., w (d) ) Each w (i) is a real number w x = w Note: x w Decision boundary is linear x,1 w, -q " x

7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7 min w, b 1 2 s. t. " i, y i ( x i w Want to estimate w and b! Standard way: Use a solver! Solver: software for finding solutions to common optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data! n w w + C å i= 1 + b) x i ³ 1- x i

8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 Want to estimate w, b! Alternative approach: Want to minimize f(w,b): f ( w, b) = min w, b s. t. " i, n ì d 1 å å ( j) ( j) 2 w w + C maxí0,1 - yi ( w xi + b ) = î = þ ýü i 1 j y i ( x i n w w + Cåxi i= 1 w + b) ³ 1-x i Side note: How to minimize convex functions g(z)? Use gradient descent: min z g(z) Iterate: z t+1 z t h Ñg(z t ) g(z) z

9 9 Find the minimum or maximum of an objective function given a set of constraints: arg minf 0 (x) x s.t.f i (x) apple 0,i= {1,...,k} h j (x) =0,j = {1,...l}

10 10 Linear Classification Maximum Likelihood nx nx arg min w 2 + C w s.t. 1 i=1 y i x T i w apple i i=1 i arg max nx i=1 log p (x i ) i 0 K-Means arg min J(µ) = µ 1,µ 2,...,µ k kx X j=1 i2c j x i µ j 2

11 11 Local (non global) minima and maxima: Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall /

12 12 A function f : R n! R is convex if for x, y 2 domfand any a 2 [0, 1], f(ax +(1 a)y) apple af(x)+(1 a)f(y) f(y) αf(x)+(1-α)f(y) AsetC R n is convex if for x, y 2 C and any a 2 [0, 1], f(x) ax +(1 a)y 2 C x y

13 13 3 SVM loss: f(w) = [ 1 y i x T i w ] + [1 - x] + Binary logistic loss: f(w) = log ( 1+exp( y i x T i w) ) log(1 + e x ) 0 2 3

14 14 minimize x s.t. f 0 (x) (Convex function) f i (x) 0 (Convex sets) h j (x) =0 (Affine)

15 15

16 16 The simplest algorithm in the world (almost). Goal: minimize x f(x) Just iterate x t+1 = x t η t f(x t ) where η t is stepsize.

17 17 f(x) (x t,f(x t )) f(x t )-η f(x t ) T (x - x t )

18 18

19 19 Idea: use a second-order approximation to function. f(x + x) f(x)+ f(x) T x xt 2 f(x) x Choose x to minimize above: x = [ 2 f(x) ] 1 f(x) This is descent direction: Inverse Hessian Gradient f(x) T x = f(x) T [ 2 f(x) ] 1 f(x) < 0.

20 20 ˆf f (x, f(x)) (x + x, f(x + x)) ˆf is 2 nd -order approximation, f is true function.

21 21 } Lots of non-differentiable convex functions used in machine learning: f(y) (x, f(x)) f(x)+g T (y - x) The subgradient set, orsubdifferential set, f(x) of f at x is f(x) = { g : f(y) f(x)+g T (y x) for all y }.

22 22 Really, the simplest algorithm in the world. Goal: minimize x f(x) Just iterate x t+1 = x t η t g t where η t is a stepsize, g t f(x t ).

23 23 Goal of machine learning : Minimize expected loss given samples This is Stochastic Optimization Assume loss function is convex

24 24 Process all examples together in each step Entire training set examined at each step Very slow when n is very large

25 25 Optimize one example at a time Choose examples randomly (or reorder and choose in order) Learning representative of example distribution

26 Equivalent to online learning (the weight vector w changes with every example) Convergence guaranteed for convex functions (to local minimum) 26

27 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 27 Gradient descent: Iterate until convergence: For j = 1 d j) f ( w, b) Evaluate: Ñf = ( j) Update: w w (j) w (j) - hñf (j) n ( ( j) L( x = + å i, yi ) w C ( j) i= 1 w Problem: Computing Ñf (j) takes O(n) time! n size of the training dataset h learning rate parameter C regularization parameter

28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 28 Stochastic Gradient Descent Instead of evaluating gradient over all examples evaluate it for each individual training example j) ( j) L( xi, y Ñf ( xi ) = w + C ( j) w Stochastic gradient descent: ( i ) Iterate until convergence: For i = 1 n For j = 1 d Compute: Ñf (j) (x i ) Update: w (j) w (j) - h Ñf (j) (x i ) Ñf We just had: n ( j) ( j) L( xi, yi ) = w + Cå ( j) i= 1 w Notice: no summation over i anymore

29 29 Convergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling) Might need to decrease with time to ensure the algorithm converges eventually Basically SGD good for machine learning with large data sets!

30 30 Stochastic 1 example per iteration Batch All the examples! Sample Average Approximation (SAA): Sample m examples at each step and perform SGD on them Allows for parallelization, but choice of m based on heuristics

31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 31 Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification n = 781,000 training examples (documents) 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words

32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 32 Questions: (1) Is SGD successful at minimizing f(w,b)? (2) How quickly does SGD find the min of f(w,b)? (3) What is the error on a test set? Standard SVM Fast SVM SGD SVM Training time Value of f(w,b) Test error (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

33 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 33 SGD SVM Conventional SVM Optimization quality: f(w,b) f (w opt,b opt ) For optimizing f(w,b) within reasonable quality SGD-SVM is super fast

34 SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time k. Conjugate gradient converges in k. Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times k condition number J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 34

35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 35 Need to choose learning rate h and t 0 ht æ L( xi, yi ) ö wt + 1 wt - ç wt + C t + t0 è w ø Leon suggests: Choose t 0 so that the expected initial updates are comparable with the expected size of the weights Choose h: Select a small subsample Try various rates h (e.g., 10, 1, 0.1, 0.01, ) Pick the one that most reduces the cost Use h for next 100k iterations on the full dataset

36 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 36 Sparse Linear SVM: Feature vector x i is sparse (contains many zeros) Do not do: x i = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0, ] But represent x i as a sparse vector x i =[(4,1), (9,5), ] Can we do the SGD update more efficiently? è w ø + - C w w w hç Approximated in 2 steps: w L( xi, yi ) w -hc w w w( 1-h) æ y x L i ), ( ö cheap: x i is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated i

37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 37 Solution 1: w = s v Represent vector w as the product of scalar s and vector v Then the update procedure is: Two step update procedure: (1) (2) L( xi, yi ) w w -hc w w w( 1-h) (1) v = v ηc L x i,y i w (2) s = s(1 η) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher h

38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 38 Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create a validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset

39 Reference: Given dataset D = { x C, y C } Loss function: L θ, D l(θ; x G KL@ K, y K ) For linear models: l θ; x K, y K = l(y K, θ M φ x K ) Assumption D is drawn IID from some distribution P. Problem: min L(θ, D) S G

40 Input: D Output: θ Algorithm: Initialize θ V For t = 1,, T θ Z[@ = θ Z η Z S l(y Z, θ M φ x Z ) θ = ` _ab ^_S _. ` _ab ^_

41 Expected loss: s θ = E P [l(y, θ M φ x ] Optimal Expected loss: s = s θ = min s(θ) S Convergence: E S h s θ s Rk + L k k ZL@ η Z 2 M ZL@ η Z Where: R = θ V θ L = max l(y, θ M φ x ) M

42 Define r Z = θ Z θ and g Z = S l(y Z, θ M φ x Z ) k r Z[@ = r k Z + η k Z g k Z 2η Z θ Z θ M g Z Taking expectation w.r.t P, θ and using s s θ Z g M Z (θ θ Z ), we get: k E S h r Z[@ r k Z η k Z L k + 2η Z (s E S h[s θ Z ]) Taking sum over t = 1,, T and using k k E S h r Z[@ r V Mt@ Mt@ L k k s η Z + 2 s η Z (s E S h[s θ Z ]) ZLV ZLV

43 Using convexity of s: s η Z ZLV E S h s θ Mt@ E S h[s η Z s θ Z ZLV Substituting in the expression from previous slide: k k E S h r Z[@ r V Mt@ L k s η Z k ZLV Mt@ + 2 s η Z (s E S h[s θ ]) ZLV Rearranging the terms proves the result. ]

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression. Lecture 04 Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Classification Logistic Regression

Classification Logistic Regression O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 16: Modern Classification (I) - Separating Hyperplanes Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information