Announcements Kevin Jamieson

Similar documents
Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Classification Logistic Regression

Classification Logistic Regression

Is the test error unbiased for these programs?

Classification Logistic Regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Warm up: risk prediction with logistic regression

Stochastic Gradient Descent

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Ad Placement Strategies

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Linear classifiers: Overfitting and regularization

Adaptive Gradient Methods AdaGrad / Adam

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

STA141C: Big Data & High Performance Statistical Computing

Announcements. Proposals graded

Logistic Regression. COMP 527 Danushka Bollegala

CS60021: Scalable Data Mining. Large Scale Machine Learning

ECS289: Scalable Machine Learning

Regression with Numerical Optimization. Logistic

Kernelized Perceptron Support Vector Machines

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture: Adaptive Filtering

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

CS260: Machine Learning Algorithms

Why should you care about the solution strategies?

Nonlinear Optimization Methods for Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Nesterov s Acceleration

Overfitting, Bias / Variance Analysis

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Convergence rate of SGD

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Logistic Regression Logistic

Large-scale Stochastic Optimization

Lecture 1: Supervised Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Selected Topics in Optimization. Some slides borrowed from

Introduction to Logistic Regression and Support Vector Machine

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning

CS489/698: Intro to ML

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Linear Models for Regression

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

ECS171: Machine Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Beyond stochastic gradient descent for large-scale machine learning

Logistic Regression. William Cohen

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear Regression (continued)

Big Data Analytics. Lucas Rego Drumond

Logistic Regression. Stochastic Gradient Descent

Optimization Methods for Machine Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Summary and discussion of: Dropout Training as Adaptive Regularization

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Deep Learning for Computer Vision

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Stochastic Quasi-Newton Methods

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Linear Models in Machine Learning

Machine Learning for NLP

Day 3 Lecture 3. Optimizing deep networks

IE598 Big Data Optimization Introduction

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Linear classifiers: Logistic regression

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Optimization and Gradient Descent

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Neural Network Training

1 Machine Learning Concepts (16 points)

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Lecture 2 Machine Learning Review

SGD and Deep Learning

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

ECS171: Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Beyond stochastic gradient descent for large-scale machine learning

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Ad Placement Strategies

Linear Models for Regression CS534

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Transcription:

Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1

Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 22

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y 24

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 25

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: 28

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: 29

General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) 36

General Convex case f(w t ) f(w ) apple Clean converge nce proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, 37

Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 2

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y i x i,w) P (Y = y x, w) = i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 3

Online Learning Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 4

Going to the moon Guidance computer predicts trajectories around moon and back with - Noisy sensors - Imperfect models - Little computational power - Big risk of failure 5

Going to the moon Guidance computer predicts trajectories around moon and back with - Noisy sensors - Imperfect models Apollo 13 - Little computational power - Big risk of failure Why is Tom Hanks flying erratically? Because they didn t have the power to turn on the Kalman FIlter! 6

State Estimation - Predict current state given past state and current control input ew n = f(w n 1 )+g(u n ) x n - Given current context, compare your prediction to noisy measurement `n( ew n )=(y n h(x n, ew n )) 2 y n - Update current state to include measurement w n = ew n K n r w`n(w) w= ewn Kalman filter does optimal least squares state estimation if f,g,h are linear! 7

Recursive Least Squares (RLS) Least squares = special case of Kalman Filter: no dynamics, no control ew n = f(w n 1 )+g(u n ) `n( ew n )=(y n h(x n, ew n )) 2 w n = ew n K n r w`n(w) w= ewn 8

Recursive Least Squares (RLS) Least squares = special case of Kalman Filter: no dynamics, no control ew n = f(w n 1 )+g(u n ) = w n 1 `n( ew n )=(y n h(x n, ew n )) 2 =(y n x T n ew n ) 2 =(y n x T n w n 1 ) 2 Ideally: w n = arg min w nx (y i x T i w) 2 i=1 w n = ew n K n r w`n(w) w= ewn = w n 1 + 2(y n x T n w n 1 )K n x n 9

Recursive Least Squares (RLS) Sherman Morrison: w n = nx x i x T i i=1! 1 X n x i y i i=1 Ideally: w n = arg min w nx (y i x T i w) 2 i=1 10

Recursive Least Squares (RLS) w n = nx x i x T i i=1! 1 X n x i y i i=1 Great, what s the time-complexity of this? It is 2017. Not the 60 s is limited computation still really a problem? 11

Digital Signal Processing The original Big Data Wifi/cell-phones are constantly solving least squares to invert out multipath Low power devices, high data rates 12

Digital Signal Processing The original Big Data Wifi/cell-phones are constantly solving least squares to invert out multipath Low power devices, high data rates Gigabytes of data per second 13

Incremental Gradient Descent (x t,y t ) arrive: w t+1 = w t Note: no matrix multiply hr w (y t x T t w) 2 w=wt i We know RLS is exact. How much worse is this? In general convex `t(w) arrives: `( ) is convex () `(y) `(x)+r`(x) T (y x) 8x, y 14

Incremental Gradient Descent w t+1 w 2 2 = w t r`t(w t ) w 2 2 15

Incremental Gradient Descent 16

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 17

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n i=1 w=w t 18

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n Stochastic Gradient Descent: w t+1 = w t E[r`It (w)] = r w`it (w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} 19

Stochastic Gradient Descent Gradient Descent: 1 w t+1 = w t r w n Stochastic Gradient Descent: w t+1 = w t r w`it (w)! nx `i(w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} 20

Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: Batch gradient ascent updates: Stochastic gradient ascent updates: Online setting: Sham Kakade 2016 21

Stochastic Gradient Descent: A Learning perspective Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 22

Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data Sham Kakade 2016 23

Gradient descent in Terms of Expectations True objective function: Taking the gradient: True gradient descent rule: How do we estimate expected gradient? Sham Kakade 2016 24

SGD: Stochastic Gradient Descent True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient descent Among many other names VERY useful in practice!!! Sham Kakade 2016 25