Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised) Regression Analysis s start by talking about a few examples of supervised learning problems. pose we have a dataset giving the living areas and prices of 47 houses Portland, Oregon: e can plot this data:. Example: living areas and prices of 47 houses: 1000 900 Living area (feet 2 ) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540. housing prices. We can plot this data: Living area (feet 2 ) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540 800 where ɛ 700 i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. price (in $1000) 1000 900 800 700 600 500 400 300 200 100 0 housing prices 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 square feet Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Given x 1,..., x n R d, y 1,..., y n R, and f : R d R, y i = f (x i ) + ɛ i for i = 1,..., n, 1. price (in $1000) 600 500 Simple Linear Function: f (x i ) = x i θ. 400 300 Why is it called supervised learning? 200 100 0 Tuo Zhao Lecture 1: Supervised Learning 2/62 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Why Supervised? Tuo Zhao Lecture 1: Supervised Learning 3/62

Play on Words? Two unknown functions f0, f 1 : Rd R? y i = 1(z i = 1) f1 (x i ) + 1(z i = 0) f0 (x i ) + ɛ i, where i = 1,..., n, and z i s are i.i.d. with P(z i = 1) = δ and P(z i = 0) = 1 δ for δ (0, 1). z i : Latent Variables. Supervised? Unsupervised? Tuo Zhao Lecture 1: Supervised Learning 4/62

Linear Regression

Linear Regression Given x 1,..., x n R d, y 1,..., y n R, and θ : R d, y i = x i θ + ɛ i for i = 1,..., n, where ɛ i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. Ordinary Least Square Regression θ OLS 1 n = arg min (y i x i θ) 2. θ 2n i=1 Least Absolute Deviation Regression: θ LAD 1 n = arg min y i x i θ. θ n i=1 Tuo Zhao Lecture 1: Supervised Learning 6/62

Robust Regression Tuo Zhao Lecture 1: Supervised Learning 7/62

Linear Regression Matrix Notation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where Eɛ = 0, Eɛɛ = σ 2 I n. y = Xθ + ɛ, Ordinary Least Square Regression θ OLS = arg min θ Least Absolute Deviation Regression: θ LAD = arg min θ 1 2n y Xθ 2 2. 1 n y Xθ 1. Tuo Zhao Lecture 1: Supervised Learning 8/62

Least Square Regression Analytical Solution Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) First order optimality condition: L( θ) = 1 n X (X θ y) = 0 X X θ = X y. Analytical solution and unbiasedness: θ = (X X) 1 X y = (X X) 1 X (Xθ + ɛ) = θ + (X X) 1 X ɛ E ɛ θ = θ Tuo Zhao Lecture 1: Supervised Learning 9/62

Least Square Regression Convexity Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) Second order optimality condition: 2 L( θ) = 1 n X X 0. Convexity Function: L(θ ) L(θ) + L(θ) (θ θ). Tuo Zhao Lecture 1: Supervised Learning 10/62

Convex v.s. Nonconvex Optimization Stationary Solutions: L( θ) = 0 Local Maximum Saddle Points Local Optimum Global Optimum Global Optimum Easy but restrictive Difficult but flexible We may get stuck at a local optimum or saddle point for nonconvex optimization. Tuo Zhao Lecture 1: Supervised Learning 11/62

Maximum Likelihood Estimation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where ɛ N(0, σ 2 I n ). y = Xθ + ɛ, Likelihood Function ( L(θ) = (2πσ 2 ) n 2 exp 1 ) 2σ 2 (y Xθ) (y Xθ) Maximum Log Likelihood Estimation: θ MLE = arg max log L(θ) θ = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Tuo Zhao Lecture 1: Supervised Learning 12/62

Maximum Likelihood Estimation Maximum Log Likelihood Estimation: θ MLE = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Given σ 2 as some unknown constant, θ MLE = arg max 1 θ 2n y 1 Xθ 2 2 = arg min θ 2n y Xθ 2 2. Probabilistic Interpretation: Simple and illustrative. Restrictive and potentially misleading. Remember t-test? What if the model is wrong? Tuo Zhao Lecture 1: Supervised Learning 13/62

Computational Cost of OLS The number of basic operations, e.g., addition, subtraction, multiplication, division. Matrix Multiplication: X X: O(nd 2 ) Matrix Inverse: (X X) 1 : O(d 3 ) Matrix Vector Multiplication: X y: O(nd) Matrix Vector Multiplication: [(X X) 1 ][X y]: O(d 2 ) Overall computational cost: O(nd 2 ), given n d Tuo Zhao Lecture 1: Supervised Learning 14/62

Scalability and Efficiency of OLS Simple Closed-form Overall computational cost: O(nd 2 ). Massive data: Both n and d are large. Not very efficient and scalable. Better ways to improve the computation? Tuo Zhao Lecture 1: Supervised Learning 15/62

Optimization for Linear Regression

Vanilla Gradient Descent θ (k+1) = θ (k) η k L(θ (k) ). η k > 0 the step size parameter (fixed or line search) Stop when the gradient is small: L(θ (K) ) δ 2 f( ) (k) rf( (k) ) rf( b )=0 Tuo Zhao Lecture 1: Supervised Learning 17/62

Computational Cost of VGD Gradient: L(θ (k) ) = 1 n X (Xθ (k) y) Matrix Vector Multiplication: Xθ (k) : O(nd) Vector Subtraction: Xθ (k) y: O(n) Matrix Vector Multiplication: X (Xθ (k) y): O(nd) Overall computational cost per iteration: O(nd) Better than O(nd 2 ), but how many iterations? Tuo Zhao Lecture 1: Supervised Learning 18/62

Rate of Convergence What are Good Algorithms? Asymptotic Convergence: θ (k) θ as k? Nonasymptotic Rate of Convergence: The optimization error after k iterations Example: Gap in Objective Value Sublinear Convergence f(θ (k) ) f( θ) = O ( L/k 2) v.s. O (L/k), where L is some constant depending on the problem. Example: Gap in Parameter Linear Convergence θ (k) θ 2 ((1 = O 1/κ) k) ( (1 ) ) k v.s. O 1/ κ 2 where κ is some constant depending on the problem. Tuo Zhao Lecture 1: Supervised Learning 19/62

Iteration Complexity of Gradient Descent Iteration Complexity We need at most iterations such that ( ( )) 1 K = O κ log ε θ (K) θ 2 ε, 2 where κ is some constant depending on the problem. What is κ? It is related to smoothness and convexity. Tuo Zhao Lecture 1: Supervised Learning 20/62

Strong Convexity There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + µ θ θ 2 2 2. L( 0 ) L( )+rl( ) > ( 0 )+ µ 2 k 0 k 2 2 L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 21/62

Strong Smoothness There exists a constant L such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + L θ θ 2 2 2. L( )+rl( ) > ( 0 )+ L 2 k 0 k 2 2 L( 0 ) L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 22/62

Condition Number κ = L/µ f(θ) = 0.9 θ 2 1 + 0.1 θ2 2 f(θ) = 0.5 θ 2 1 + 0.5 θ2 2 Tuo Zhao Lecture 1: Supervised Learning 23/62

Vector Field Representation f(θ) = 0.9 θ 2 1 + 0.1 θ2 2 f(θ) = 0.5 θ 2 1 + 0.5 θ2 2 Tuo Zhao Lecture 1: Supervised Learning 24/62

Understanding Regularity Conditions Mean Value Theorem: There exists a constant z [0, 1] such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) = 1 2 (θ θ) 2 L( θ)(θ θ), where θ is a convex combination: θ = zθ + (1 z)θ. Hessian Matrix for OLS: Control the reminder: Λ min ( 1 n X X) }{{} µ 2 L( θ) = 1 n X X. (θ θ) 2 L( θ)(θ θ) θ θ 2 2 Λ max ( 1 n X X). }{{} L Tuo Zhao Lecture 1: Supervised Learning 25/62

Understanding Gradient Descent Algorithms Iteratively Minimize Quadratic Approximation At the (t + 1)-th iteration, we consider Q(θ; θ (k) ) = L(θ (k) ) + L(θ (k) )(θ θ (k) ) + L 2 We have We take Q(θ; θ (k) ) L(θ) and Q(θ (k) ; θ (k) ) = L(θ (k) ). θ (k+1) = arg min Q(θ; θ (k) ) = θ (k) 1 θ L L(θ(k) ). θ θ (k) 2 2. Tuo Zhao Lecture 1: Supervised Learning 26/62

Backtracking Line Search The worst case fixed step size: η k = 1/L. At the (k + 1)-th iteration, we take η k = η k 1, i.e., θ (k+1) = θ (k) η k L(θ (k) ) if Q(θ (k+1) ; θ (k) ) L(θ). Otherwise, we take η k+1 = (1 δ) m η k, where δ > 0 and m is the smallest positive integer such that Q(θ (k+1) ; θ (k) ) L(θ). Tuo Zhao Lecture 1: Supervised Learning 27/62

Backtracking Line Search k =0.75 2 k 1 k = k 1 (k) k =0.75 k 1 Tuo Zhao Lecture 1: Supervised Learning 28/62

Tradeoff Statistics and Computation

High Precision or Low Precision Can we tolerate a large ε? From a learning perspective, our interest is θ, not θ. Error decomposition: θ (K) θ 2 θ (K) θ + θ θ 2. }{{} 2 }{{} Opt. Error Stat. Error High precision expects something like θ (K) θ 10 10 2 Does it make any difference? Tuo Zhao Lecture 1: Supervised Learning 30/62

High Precision or Low Precision The statistical error is undefeatable! Error in Logarithmic Scale Statistical Error Optimization Error Number of Iterations Tuo Zhao Lecture 1: Supervised Learning 31/62

Tradeoff Statistical and Optimization Errors The statistical error is undefeatable! The statistical error of the optimal solution: E θ θ 2 (X = E X) 1 X ɛ 2 2 2 ( σ = σ 2 tr[(x X) 1 2 ) d ] = O n We only need θ (K) θ θ θ 2. 2 Given K = O ( κ log ( n σ d)), we have 2 ( E θ (K) θ 2 σ 2 ) = O d. 2 n Tuo Zhao Lecture 1: Supervised Learning 32/62

Agnostic Learning All models are wrong, but some are useful! Data generating process: (X, Y ) D. The oracle model: f oracle (X) = X θ oracle, where θ oracle = arg min E D (Y X θ) 2. θ The estimated model: f(x) = X θ 1 θ = arg min θ 2n y Xθ 2 2, and (x 1, y 1 ),..., (x n, y n ) D At the K-th iteration: f (K) (X) = X θ (K) Tuo Zhao Lecture 1: Supervised Learning 33/62

Agnostic Learning (See more details in CS-7545) All models are wrong, but some are useful! Estimated model True model Oracle model All linear models Tuo Zhao Lecture 1: Supervised Learning 34/62

Agnostic Learning Approximation error: E D (Y f oracle (X)) 2 Estimation error : E D ( f(x) f oracle (X)) 2 Optimization error : E D (f (K) (X) f(x)) 2 Decomposition of the statistical error: E D (Y f (K) (X)) 2 E D (Y f oracle (X)) 2 + E D ( f(x) f oracle (X)) 2 + E D (f (K) (X) f(x)) 2 How should we choose ε? Tuo Zhao Lecture 1: Supervised Learning 35/62

Scalable Computation of Linear Regression

Stochastic Approximation What if n is too large? Empirical Risk Minimization: L(θ) = 1 n n i=1 l i(θ). For Least Square Regression: l i (θ) = 1 2 (y i x i θ) 2 or l i (θ) = 1 (y j x j θ) 2. 2 M i j M i Randomly sample i from 1,..., n with equal probability, E i l i (θ) = L(θ) and E l i (θ) L(θ) 2 2 M 2. Stochastic Gradient (SG): replace L(θ) with l i (θ) θ (k+1) = θ (k) η k l it (θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 37/62

Why Stochastic Gradient? Perturbed Descent Directions Tuo Zhao Lecture 1: Supervised Learning 38/62

Convergence of Stochastic Gradient Algorithms How many iterations do we need? A sequence of decreasing step size parameters: η k 1 kµ Given a pre-specified error ε, we need ( M 2 + L 2 ) K = O µ 2 ε iterations such that E θ (K) θ 2 ε, where 2 θ(k) = 1 T T θ (k). t=1 When µ 2 εn (M 2 + L 2 ), i.e., n is super large, ( d(m 2 + L 2 ) ) O µ 2 v.s. Õ (κnd) v.s. O(nd 2 ). ε Tuo Zhao Lecture 1: Supervised Learning 39/62

Why decreasing step size? Control Variance + Sufficient Descent Convergence Intuition: θ (k+1) = θ (k) η L(θ (k) ) + η( L(θ (k) ) l }{{} i (θ (k) )) }{{} Descent Error Not summable: Sufficient exploration, η k =. t=1 Square summable: Diminishing Variance, ηk 2 <. t=1 Tuo Zhao Lecture 1: Supervised Learning 40/62

Minibatch Variance Reduction Mini-batch SGD: If M i, then M 2 M i means more computational cost per iteration. M 2 means fewer iterations. Tuo Zhao Lecture 1: Supervised Learning 41/62

Variance Reduction by Control Variates Stochastic Variance Reduced Gradient Algorithm (SVRG): At the k-th epoch, θ = θ [k], θ (0) = θ [k]. At the k-th iteration of the k-th epoch, θ (k+1) = θ (k) η k ( l i (θ (k) ) l i ( θ) + L( θ)). After m iterations of the k-th epoch, θ [k+1] = θ (m). Tuo Zhao Lecture 1: Supervised Learning 42/62

Strong Smoothness and Convexity Regularity Conditions (Strong Smoothness) There exist constants L i s such that for any θ and θ, we have l i (θ ) l i (θ) l i (θ) (θ θ) L i θ θ 2 2 2. (Strong Convexity) There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) µ θ θ 2 2 2. Condition Number κ max = max i L i µ κ = L µ. Tuo Zhao Lecture 1: Supervised Learning 43/62

Why does SVRG work? The strong smoothness implies: l i (θ (k) θ ) l i ( θ) L (k) max θ 2 2 Bias correction: E[ l i (θ (k) ) l i ( θ)] = L(θ (k) ) L( θ). Variance Reduction: As θ (k) θ and θ θ E l i (θ (k) ) l i ( θ) + L( θ) L(θ (k) ) 2 2 E l i (θ (k) ) l i ( θ) 2 0. 2 Tuo Zhao Lecture 1: Supervised Learning 44/62

Convergence of SVRG How many iterations do we need? Fixed step size parameter: η k 1 L max Given a pre-specified error ε and m κ max, we need ( ( )) 1 K = O log ε epochs such that E θ [K] θ 2 ε. 2 Total number of operations: ( ) dm 2 Õ (nd + dκ max ) v.s. O µ 2 ε + dκ2 ε v.s. Õ (ndκ). Tuo Zhao Lecture 1: Supervised Learning 45/62

Comparison of GD, SGD and SVRG Tuo Zhao Lecture 1: Supervised Learning 46/62

Summary The empirical performance highly depends on the implementation. Cyclic or shuffled order (Not stochastic) in practice. Too many tuning parameters means my algorithm might only work in theory. Theoretical bounds can be very loose. The constant may matter a lot in practice. Good software engineers with B.S./M.S. degrees can earn much more than Ph.D. s, if they know how to code efficient algorithms. Tuo Zhao Lecture 1: Supervised Learning 47/62

Classification Analysis

Classification v.s. Regression Tuo Zhao Lecture 1: Supervised Learning 49/62

Logistic Regression Given x 1,..., x n R d and θ R d, y i Bernoulli(h(x i θ )) for i = 1,..., n, h: (, ) [0, 1] Logistic/Sigmoid function h(z) = 1 1 + exp( z). Remark: h(0) = 0.5, h( ) = 0 and h( ) = 1. Tuo Zhao Lecture 1: Supervised Learning 50/62

Logistic/Sigmoid Function Tuo Zhao Lecture 1: Supervised Learning 51/62

Logistic Regression Maximum Likelihood Estimation θ = arg max L(θ) θ = arg max θ = arg max θ = arg max θ = arg min θ log n i=1 ( ) h(x yi ( ) i θ) 1 h(x 1 yi i θ) n y i log h(x i θ) + (1 y i ) log(1 h(x i θ)) i=1 n i=1 1 n [ ] y i x i θ log(1 + exp(x i θ)) n i=1 [ ] log(1 + exp(x i θ)) y i x i θ. Tuo Zhao Lecture 1: Supervised Learning 52/62

Optimization for Logistic Regression Convex Problem? Let F(θ) = L(θ). 2 F(θ) = 1 n h(x i θ)(1 h(x i θ))x i x i 0. n i=1 No closed form solution. Gradient descent and stochastic gradient algorithms are applicable. Tuo Zhao Lecture 1: Supervised Learning 53/62

Prediction for Logistic Regression Prediction: Given x, we have P(y 1 = 1) = 1 + exp( θ x ) 0.5. Why linear classification? P(y = 1) 0.5 θ x 0 ŷ = sign( θ x ). Tuo Zhao Lecture 1: Supervised Learning 54/62

Logistic Loss Given x 1,..., x n R d, y 1,..., y n { 1, 1}, and θ : R d, P(y i = 1) = 1 1 + exp( x i θ ) for i = 1,..., n, An alternative formulation: 1 n θ = arg min log(1 + exp( y i x i θ)). θ n We can also use 0-1 loss: θ = arg min θ i=1 1 n n 1(sign(x i θ) y i ). i=1 Tuo Zhao Lecture 1: Supervised Learning 55/62

Loss Functions for Classification Tuo Zhao Lecture 1: Supervised Learning 56/62

Newton s Method At the k-th iteration, we take θ (k+1) = θ (k) [ η k 2 F(θ) ] 1 F(θ), where η k > 0 is a step size parameter. The Second Order Taylor s Approximation: θ (k+0.5) = arg min θ F(θ (k) ) + F(θ (k) )(θ θ (k) )+ Backtracking Line Search: + 1 2 (θ θ(k) ) 2 F(θ (k) )(θ θ (k) ). θ (k+1) = θ (k) + η k (θ (k+0.5) θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 57/62

Newton s Method Tuo Zhao Lecture 1: Supervised Learning 58/62

Newton s Method Sublinear + Quadratic Convergence: Given θ (k) θ 2 R 1, we have 2 θ (k+1) θ 2 θ (1 δ) (k) θ 4. 2 2 Given θ (k) θ 2 2 R, we have θ (k+1) θ 2 = O(1/k). 2 Iteration complexity: (Some parameters hidden for simplicity) ( ( ) ( )) 1 1 O log + log log. R ɛ Tuo Zhao Lecture 1: Supervised Learning 59/62

Newton s Method Advantage: More efficient for highly accurate solutions. Avoid extensively calculating log or exp functions. Taylor expansions combined with a table. Fewer line search steps. Due to quadratic convergence. Often more efficient than gradient descent. Tuo Zhao Lecture 1: Supervised Learning 60/62

Newton s Method Tuo Zhao Lecture 1: Supervised Learning 61/62

Newton s Method Disadvantage: Computing inverse Hessian matrices is expensive! Storing inverse Hessian matrices is expensive! Subsampled Newton: 1 θ (k+1) = θ (k) η k [H(θ )] (k) F(θ (k) ). Quasi-Newton Method: DFP, BFGS, Broyden, SR1 Use differences of gradient vectors to approximate Hessian matrices. Tuo Zhao Lecture 1: Supervised Learning 62/62