Lecture 1: Supervised Learning

Similar documents
Regression with Numerical Optimization. Logistic

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Overfitting, Bias / Variance Analysis

Machine Learning. Lecture 2: Linear regression. Feng Li.

Lecture 3 - Linear and Logistic Regression

STA141C: Big Data & High Performance Statistical Computing

Linear Models in Machine Learning

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

ECS289: Scalable Machine Learning

ECS171: Machine Learning

Accelerating Stochastic Optimization

Neural Network Training

Logistic Regression. Stochastic Gradient Descent

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning

Optimization for neural networks

ECS171: Machine Learning

Linear and logistic regression

Selected Topics in Optimization. Some slides borrowed from

Support Vector Machines

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Probabilistic Graphical Models & Applications

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Regression (continued)

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Advanced computational methods X Selected Topics: SGD

Stochastic Optimization Algorithms Beyond SG

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Stochastic Quasi-Newton Methods

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Association studies and regression

Nonlinear Optimization Methods for Machine Learning

CSC321 Lecture 8: Optimization

Why should you care about the solution strategies?

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Stochastic Gradient Descent. CS 584: Big Data Analytics

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Gradient Descent. Sargur Srihari

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

CPSC 540: Machine Learning

Large-scale Stochastic Optimization

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning Basics: Maximum Likelihood Estimation

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Sub-Sampled Newton Methods

Statistical Machine Learning Hilary Term 2018

Empirical Risk Minimization

Stochastic optimization in Hilbert spaces

Beyond stochastic gradient descent for large-scale machine learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Estimators based on non-convex programs: Statistical and computational guarantees

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Computing the MLE and the EM Algorithm

Stochastic Gradient Descent

CS229 Lecture notes. Andrew Ng

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Part 4: Conditional Random Fields

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Neural Networks and Deep Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Beyond stochastic gradient descent for large-scale machine learning

Fast Stochastic Optimization Algorithms for ML

Towards stability and optimality in stochastic gradient descent

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Big Data Analytics: Optimization and Randomization

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Algorithmic Stability and Generalization Christoph Lampert

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Optimization Methods for Machine Learning

Optimization for Machine Learning

CSC321 Lecture 7: Optimization

Linear classifiers: Overfitting and regularization

Statistical Machine Learning Hilary Term 2018

Warm up: risk prediction with logistic regression

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Comparison of Modern Stochastic Optimization Algorithms

Lecture 2 Machine Learning Review

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Convex Optimization Lecture 16

Lecture 5: Logistic Regression. Neural Networks

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Announcements Kevin Jamieson

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

On Markov chain Monte Carlo methods for tall data

Stochastic and online algorithms

Introduction to Machine Learning HW6

Neural Networks: Optimization & Regularization

Transcription:

Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised) Regression Analysis s start by talking about a few examples of supervised learning problems. pose we have a dataset giving the living areas and prices of 47 houses Portland, Oregon: e can plot this data:. Example: living areas and prices of 47 houses: 1000 900 Living area (feet 2 ) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540. housing prices. We can plot this data: Living area (feet 2 ) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540 800 where ɛ 700 i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. price (in $1000) 1000 900 800 700 600 500 400 300 200 100 0 housing prices 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 square feet Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Given x 1,..., x n R d, y 1,..., y n R, and f : R d R, y i = f (x i ) + ɛ i for i = 1,..., n, 1. price (in $1000) 600 500 Simple Linear Function: f (x i ) = x i θ. 400 300 Why is it called supervised learning? 200 100 0 Tuo Zhao Lecture 1: Supervised Learning 2/62 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Why Supervised? Tuo Zhao Lecture 1: Supervised Learning 3/62

Play on Words? Two unknown functions f0, f 1 : Rd R? y i = 1(z i = 1) f1 (x i ) + 1(z i = 0) f0 (x i ) + ɛ i, where i = 1,..., n, and z i s are i.i.d. with P(z i = 1) = δ and P(z i = 0) = 1 δ for δ (0, 1). z i : Latent Variables. Supervised? Unsupervised? Tuo Zhao Lecture 1: Supervised Learning 4/62

Linear Regression

Linear Regression Given x 1,..., x n R d, y 1,..., y n R, and θ : R d, y i = x i θ + ɛ i for i = 1,..., n, where ɛ i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. Ordinary Least Square Regression θ OLS 1 n = arg min (y i x i θ) 2. θ 2n i=1 Least Absolute Deviation Regression: θ LAD 1 n = arg min y i x i θ. θ n i=1 Tuo Zhao Lecture 1: Supervised Learning 6/62

Robust Regression Tuo Zhao Lecture 1: Supervised Learning 7/62

Linear Regression Matrix Notation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where Eɛ = 0, Eɛɛ = σ 2 I n. y = Xθ + ɛ, Ordinary Least Square Regression θ OLS = arg min θ Least Absolute Deviation Regression: θ LAD = arg min θ 1 2n y Xθ 2 2. 1 n y Xθ 1. Tuo Zhao Lecture 1: Supervised Learning 8/62

Least Square Regression Analytical Solution Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) First order optimality condition: L( θ) = 1 n X (X θ y) = 0 X X θ = X y. Analytical solution and unbiasedness: θ = (X X) 1 X y = (X X) 1 X (Xθ + ɛ) = θ + (X X) 1 X ɛ E ɛ θ = θ Tuo Zhao Lecture 1: Supervised Learning 9/62

Least Square Regression Convexity Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) Second order optimality condition: 2 L( θ) = 1 n X X 0. Convexity Function: L(θ ) L(θ) + L(θ) (θ θ). Tuo Zhao Lecture 1: Supervised Learning 10/62

Convex v.s. Nonconvex Optimization Stationary Solutions: L( θ) = 0 Local Maximum Saddle Points Local Optimum Global Optimum Global Optimum Easy but restrictive Difficult but flexible We may get stuck at a local optimum or saddle point for nonconvex optimization. Tuo Zhao Lecture 1: Supervised Learning 11/62

Maximum Likelihood Estimation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where ɛ N(0, σ 2 I n ). y = Xθ + ɛ, Likelihood Function ( L(θ) = (2πσ 2 ) n 2 exp 1 ) 2σ 2 (y Xθ) (y Xθ) Maximum Log Likelihood Estimation: θ MLE = arg max log L(θ) θ = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Tuo Zhao Lecture 1: Supervised Learning 12/62

Maximum Likelihood Estimation Maximum Log Likelihood Estimation: θ MLE = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Given σ 2 as some unknown constant, θ MLE = arg max 1 θ 2n y 1 Xθ 2 2 = arg min θ 2n y Xθ 2 2. Probabilistic Interpretation: Simple and illustrative. Restrictive and potentially misleading. Remember t-test? What if the model is wrong? Tuo Zhao Lecture 1: Supervised Learning 13/62

Computational Cost of OLS The number of basic operations, e.g., addition, subtraction, multiplication, division. Matrix Multiplication: X X: O(nd 2 ) Matrix Inverse: (X X) 1 : O(d 3 ) Matrix Vector Multiplication: X y: O(nd) Matrix Vector Multiplication: [(X X) 1 ][X y]: O(d 2 ) Overall computational cost: O(nd 2 ), given n d Tuo Zhao Lecture 1: Supervised Learning 14/62

Scalability and Efficiency of OLS Simple Closed-form Overall computational cost: O(nd 2 ). Massive data: Both n and d are large. Not very efficient and scalable. Better ways to improve the computation? Tuo Zhao Lecture 1: Supervised Learning 15/62

Optimization for Linear Regression

Vanilla Gradient Descent θ (k+1) = θ (k) η k L(θ (k) ). η k > 0 the step size parameter (fixed or line search) Stop when the gradient is small: L(θ (K) ) δ 2 f( ) (k) rf( (k) ) rf( b )=0 Tuo Zhao Lecture 1: Supervised Learning 17/62

Computational Cost of VGD Gradient: L(θ (k) ) = 1 n X (Xθ (k) y) Matrix Vector Multiplication: Xθ (k) : O(nd) Vector Subtraction: Xθ (k) y: O(n) Matrix Vector Multiplication: X (Xθ (k) y): O(nd) Overall computational cost per iteration: O(nd) Better than O(nd 2 ), but how many iterations? Tuo Zhao Lecture 1: Supervised Learning 18/62

Rate of Convergence What are Good Algorithms? Asymptotic Convergence: θ (k) θ as k? Nonasymptotic Rate of Convergence: The optimization error after k iterations Example: Gap in Objective Value Sublinear Convergence f(θ (k) ) f( θ) = O ( L/k 2) v.s. O (L/k), where L is some constant depending on the problem. Example: Gap in Parameter Linear Convergence θ (k) θ 2 ((1 = O 1/κ) k) ( (1 ) ) k v.s. O 1/ κ 2 where κ is some constant depending on the problem. Tuo Zhao Lecture 1: Supervised Learning 19/62

Iteration Complexity of Gradient Descent Iteration Complexity We need at most iterations such that ( ( )) 1 K = O κ log ε θ (K) θ 2 ε, 2 where κ is some constant depending on the problem. What is κ? It is related to smoothness and convexity. Tuo Zhao Lecture 1: Supervised Learning 20/62

Strong Convexity There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + µ θ θ 2 2 2. L( 0 ) L( )+rl( ) > ( 0 )+ µ 2 k 0 k 2 2 L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 21/62

Strong Smoothness There exists a constant L such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + L θ θ 2 2 2. L( )+rl( ) > ( 0 )+ L 2 k 0 k 2 2 L( 0 ) L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 22/62

Condition Number κ = L/µ f(θ) = 0.9 θ 2 1 + 0.1 θ2 2 f(θ) = 0.5 θ 2 1 + 0.5 θ2 2 Tuo Zhao Lecture 1: Supervised Learning 23/62

Vector Field Representation f(θ) = 0.9 θ 2 1 + 0.1 θ2 2 f(θ) = 0.5 θ 2 1 + 0.5 θ2 2 Tuo Zhao Lecture 1: Supervised Learning 24/62

Understanding Regularity Conditions Mean Value Theorem: There exists a constant z [0, 1] such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) = 1 2 (θ θ) 2 L( θ)(θ θ), where θ is a convex combination: θ = zθ + (1 z)θ. Hessian Matrix for OLS: Control the reminder: Λ min ( 1 n X X) }{{} µ 2 L( θ) = 1 n X X. (θ θ) 2 L( θ)(θ θ) θ θ 2 2 Λ max ( 1 n X X). }{{} L Tuo Zhao Lecture 1: Supervised Learning 25/62

Understanding Gradient Descent Algorithms Iteratively Minimize Quadratic Approximation At the (t + 1)-th iteration, we consider Q(θ; θ (k) ) = L(θ (k) ) + L(θ (k) )(θ θ (k) ) + L 2 We have We take Q(θ; θ (k) ) L(θ) and Q(θ (k) ; θ (k) ) = L(θ (k) ). θ (k+1) = arg min Q(θ; θ (k) ) = θ (k) 1 θ L L(θ(k) ). θ θ (k) 2 2. Tuo Zhao Lecture 1: Supervised Learning 26/62

Backtracking Line Search The worst case fixed step size: η k = 1/L. At the (k + 1)-th iteration, we take η k = η k 1, i.e., θ (k+1) = θ (k) η k L(θ (k) ) if Q(θ (k+1) ; θ (k) ) L(θ). Otherwise, we take η k+1 = (1 δ) m η k, where δ > 0 and m is the smallest positive integer such that Q(θ (k+1) ; θ (k) ) L(θ). Tuo Zhao Lecture 1: Supervised Learning 27/62

Backtracking Line Search k =0.75 2 k 1 k = k 1 (k) k =0.75 k 1 Tuo Zhao Lecture 1: Supervised Learning 28/62

Tradeoff Statistics and Computation

High Precision or Low Precision Can we tolerate a large ε? From a learning perspective, our interest is θ, not θ. Error decomposition: θ (K) θ 2 θ (K) θ + θ θ 2. }{{} 2 }{{} Opt. Error Stat. Error High precision expects something like θ (K) θ 10 10 2 Does it make any difference? Tuo Zhao Lecture 1: Supervised Learning 30/62

High Precision or Low Precision The statistical error is undefeatable! Error in Logarithmic Scale Statistical Error Optimization Error Number of Iterations Tuo Zhao Lecture 1: Supervised Learning 31/62

Tradeoff Statistical and Optimization Errors The statistical error is undefeatable! The statistical error of the optimal solution: E θ θ 2 (X = E X) 1 X ɛ 2 2 2 ( σ = σ 2 tr[(x X) 1 2 ) d ] = O n We only need θ (K) θ θ θ 2. 2 Given K = O ( κ log ( n σ d)), we have 2 ( E θ (K) θ 2 σ 2 ) = O d. 2 n Tuo Zhao Lecture 1: Supervised Learning 32/62

Agnostic Learning All models are wrong, but some are useful! Data generating process: (X, Y ) D. The oracle model: f oracle (X) = X θ oracle, where θ oracle = arg min E D (Y X θ) 2. θ The estimated model: f(x) = X θ 1 θ = arg min θ 2n y Xθ 2 2, and (x 1, y 1 ),..., (x n, y n ) D At the K-th iteration: f (K) (X) = X θ (K) Tuo Zhao Lecture 1: Supervised Learning 33/62

Agnostic Learning (See more details in CS-7545) All models are wrong, but some are useful! Estimated model True model Oracle model All linear models Tuo Zhao Lecture 1: Supervised Learning 34/62

Agnostic Learning Approximation error: E D (Y f oracle (X)) 2 Estimation error : E D ( f(x) f oracle (X)) 2 Optimization error : E D (f (K) (X) f(x)) 2 Decomposition of the statistical error: E D (Y f (K) (X)) 2 E D (Y f oracle (X)) 2 + E D ( f(x) f oracle (X)) 2 + E D (f (K) (X) f(x)) 2 How should we choose ε? Tuo Zhao Lecture 1: Supervised Learning 35/62

Scalable Computation of Linear Regression

Stochastic Approximation What if n is too large? Empirical Risk Minimization: L(θ) = 1 n n i=1 l i(θ). For Least Square Regression: l i (θ) = 1 2 (y i x i θ) 2 or l i (θ) = 1 (y j x j θ) 2. 2 M i j M i Randomly sample i from 1,..., n with equal probability, E i l i (θ) = L(θ) and E l i (θ) L(θ) 2 2 M 2. Stochastic Gradient (SG): replace L(θ) with l i (θ) θ (k+1) = θ (k) η k l it (θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 37/62

Why Stochastic Gradient? Perturbed Descent Directions Tuo Zhao Lecture 1: Supervised Learning 38/62

Convergence of Stochastic Gradient Algorithms How many iterations do we need? A sequence of decreasing step size parameters: η k 1 kµ Given a pre-specified error ε, we need ( M 2 + L 2 ) K = O µ 2 ε iterations such that E θ (K) θ 2 ε, where 2 θ(k) = 1 T T θ (k). t=1 When µ 2 εn (M 2 + L 2 ), i.e., n is super large, ( d(m 2 + L 2 ) ) O µ 2 v.s. Õ (κnd) v.s. O(nd 2 ). ε Tuo Zhao Lecture 1: Supervised Learning 39/62

Why decreasing step size? Control Variance + Sufficient Descent Convergence Intuition: θ (k+1) = θ (k) η L(θ (k) ) + η( L(θ (k) ) l }{{} i (θ (k) )) }{{} Descent Error Not summable: Sufficient exploration, η k =. t=1 Square summable: Diminishing Variance, ηk 2 <. t=1 Tuo Zhao Lecture 1: Supervised Learning 40/62

Minibatch Variance Reduction Mini-batch SGD: If M i, then M 2 M i means more computational cost per iteration. M 2 means fewer iterations. Tuo Zhao Lecture 1: Supervised Learning 41/62

Variance Reduction by Control Variates Stochastic Variance Reduced Gradient Algorithm (SVRG): At the k-th epoch, θ = θ [k], θ (0) = θ [k]. At the k-th iteration of the k-th epoch, θ (k+1) = θ (k) η k ( l i (θ (k) ) l i ( θ) + L( θ)). After m iterations of the k-th epoch, θ [k+1] = θ (m). Tuo Zhao Lecture 1: Supervised Learning 42/62

Strong Smoothness and Convexity Regularity Conditions (Strong Smoothness) There exist constants L i s such that for any θ and θ, we have l i (θ ) l i (θ) l i (θ) (θ θ) L i θ θ 2 2 2. (Strong Convexity) There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) µ θ θ 2 2 2. Condition Number κ max = max i L i µ κ = L µ. Tuo Zhao Lecture 1: Supervised Learning 43/62

Why does SVRG work? The strong smoothness implies: l i (θ (k) θ ) l i ( θ) L (k) max θ 2 2 Bias correction: E[ l i (θ (k) ) l i ( θ)] = L(θ (k) ) L( θ). Variance Reduction: As θ (k) θ and θ θ E l i (θ (k) ) l i ( θ) + L( θ) L(θ (k) ) 2 2 E l i (θ (k) ) l i ( θ) 2 0. 2 Tuo Zhao Lecture 1: Supervised Learning 44/62

Convergence of SVRG How many iterations do we need? Fixed step size parameter: η k 1 L max Given a pre-specified error ε and m κ max, we need ( ( )) 1 K = O log ε epochs such that E θ [K] θ 2 ε. 2 Total number of operations: ( ) dm 2 Õ (nd + dκ max ) v.s. O µ 2 ε + dκ2 ε v.s. Õ (ndκ). Tuo Zhao Lecture 1: Supervised Learning 45/62

Comparison of GD, SGD and SVRG Tuo Zhao Lecture 1: Supervised Learning 46/62

Summary The empirical performance highly depends on the implementation. Cyclic or shuffled order (Not stochastic) in practice. Too many tuning parameters means my algorithm might only work in theory. Theoretical bounds can be very loose. The constant may matter a lot in practice. Good software engineers with B.S./M.S. degrees can earn much more than Ph.D. s, if they know how to code efficient algorithms. Tuo Zhao Lecture 1: Supervised Learning 47/62

Classification Analysis

Classification v.s. Regression Tuo Zhao Lecture 1: Supervised Learning 49/62

Logistic Regression Given x 1,..., x n R d and θ R d, y i Bernoulli(h(x i θ )) for i = 1,..., n, h: (, ) [0, 1] Logistic/Sigmoid function h(z) = 1 1 + exp( z). Remark: h(0) = 0.5, h( ) = 0 and h( ) = 1. Tuo Zhao Lecture 1: Supervised Learning 50/62

Logistic/Sigmoid Function Tuo Zhao Lecture 1: Supervised Learning 51/62

Logistic Regression Maximum Likelihood Estimation θ = arg max L(θ) θ = arg max θ = arg max θ = arg max θ = arg min θ log n i=1 ( ) h(x yi ( ) i θ) 1 h(x 1 yi i θ) n y i log h(x i θ) + (1 y i ) log(1 h(x i θ)) i=1 n i=1 1 n [ ] y i x i θ log(1 + exp(x i θ)) n i=1 [ ] log(1 + exp(x i θ)) y i x i θ. Tuo Zhao Lecture 1: Supervised Learning 52/62

Optimization for Logistic Regression Convex Problem? Let F(θ) = L(θ). 2 F(θ) = 1 n h(x i θ)(1 h(x i θ))x i x i 0. n i=1 No closed form solution. Gradient descent and stochastic gradient algorithms are applicable. Tuo Zhao Lecture 1: Supervised Learning 53/62

Prediction for Logistic Regression Prediction: Given x, we have P(y 1 = 1) = 1 + exp( θ x ) 0.5. Why linear classification? P(y = 1) 0.5 θ x 0 ŷ = sign( θ x ). Tuo Zhao Lecture 1: Supervised Learning 54/62

Logistic Loss Given x 1,..., x n R d, y 1,..., y n { 1, 1}, and θ : R d, P(y i = 1) = 1 1 + exp( x i θ ) for i = 1,..., n, An alternative formulation: 1 n θ = arg min log(1 + exp( y i x i θ)). θ n We can also use 0-1 loss: θ = arg min θ i=1 1 n n 1(sign(x i θ) y i ). i=1 Tuo Zhao Lecture 1: Supervised Learning 55/62

Loss Functions for Classification Tuo Zhao Lecture 1: Supervised Learning 56/62

Newton s Method At the k-th iteration, we take θ (k+1) = θ (k) [ η k 2 F(θ) ] 1 F(θ), where η k > 0 is a step size parameter. The Second Order Taylor s Approximation: θ (k+0.5) = arg min θ F(θ (k) ) + F(θ (k) )(θ θ (k) )+ Backtracking Line Search: + 1 2 (θ θ(k) ) 2 F(θ (k) )(θ θ (k) ). θ (k+1) = θ (k) + η k (θ (k+0.5) θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 57/62

Newton s Method Tuo Zhao Lecture 1: Supervised Learning 58/62

Newton s Method Sublinear + Quadratic Convergence: Given θ (k) θ 2 R 1, we have 2 θ (k+1) θ 2 θ (1 δ) (k) θ 4. 2 2 Given θ (k) θ 2 2 R, we have θ (k+1) θ 2 = O(1/k). 2 Iteration complexity: (Some parameters hidden for simplicity) ( ( ) ( )) 1 1 O log + log log. R ɛ Tuo Zhao Lecture 1: Supervised Learning 59/62

Newton s Method Advantage: More efficient for highly accurate solutions. Avoid extensively calculating log or exp functions. Taylor expansions combined with a table. Fewer line search steps. Due to quadratic convergence. Often more efficient than gradient descent. Tuo Zhao Lecture 1: Supervised Learning 60/62

Newton s Method Tuo Zhao Lecture 1: Supervised Learning 61/62

Newton s Method Disadvantage: Computing inverse Hessian matrices is expensive! Storing inverse Hessian matrices is expensive! Subsampled Newton: 1 θ (k+1) = θ (k) η k [H(θ )] (k) F(θ (k) ). Quasi-Newton Method: DFP, BFGS, Broyden, SR1 Use differences of gradient vectors to approximate Hessian matrices. Tuo Zhao Lecture 1: Supervised Learning 62/62