Lecture 1: Supervised Learning
|
|
- Amos Whitehead
- 6 years ago
- Views:
Transcription
1 Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech
2 ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised) Regression Analysis s start by talking about a few examples of supervised learning problems. pose we have a dataset giving the living areas and prices of 47 houses Portland, Oregon: e can plot this data:. Example: living areas and prices of 47 houses: Living area (feet 2 ) Price (1000$s) housing prices. We can plot this data: Living area (feet 2 ) Price (1000$s) where ɛ 700 i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. price (in $1000) housing prices square feet Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Given x 1,..., x n R d, y 1,..., y n R, and f : R d R, y i = f (x i ) + ɛ i for i = 1,..., n, 1. price (in $1000) Simple Linear Function: f (x i ) = x i θ Why is it called supervised learning? Tuo Zhao Lecture 1: Supervised Learning 2/
3 Why Supervised? Tuo Zhao Lecture 1: Supervised Learning 3/62
4 Play on Words? Two unknown functions f0, f 1 : Rd R? y i = 1(z i = 1) f1 (x i ) + 1(z i = 0) f0 (x i ) + ɛ i, where i = 1,..., n, and z i s are i.i.d. with P(z i = 1) = δ and P(z i = 0) = 1 δ for δ (0, 1). z i : Latent Variables. Supervised? Unsupervised? Tuo Zhao Lecture 1: Supervised Learning 4/62
5 Linear Regression
6 Linear Regression Given x 1,..., x n R d, y 1,..., y n R, and θ : R d, y i = x i θ + ɛ i for i = 1,..., n, where ɛ i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. Ordinary Least Square Regression θ OLS 1 n = arg min (y i x i θ) 2. θ 2n i=1 Least Absolute Deviation Regression: θ LAD 1 n = arg min y i x i θ. θ n i=1 Tuo Zhao Lecture 1: Supervised Learning 6/62
7 Robust Regression Tuo Zhao Lecture 1: Supervised Learning 7/62
8 Linear Regression Matrix Notation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where Eɛ = 0, Eɛɛ = σ 2 I n. y = Xθ + ɛ, Ordinary Least Square Regression θ OLS = arg min θ Least Absolute Deviation Regression: θ LAD = arg min θ 1 2n y Xθ n y Xθ 1. Tuo Zhao Lecture 1: Supervised Learning 8/62
9 Least Square Regression Analytical Solution Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) First order optimality condition: L( θ) = 1 n X (X θ y) = 0 X X θ = X y. Analytical solution and unbiasedness: θ = (X X) 1 X y = (X X) 1 X (Xθ + ɛ) = θ + (X X) 1 X ɛ E ɛ θ = θ Tuo Zhao Lecture 1: Supervised Learning 9/62
10 Least Square Regression Convexity Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) Second order optimality condition: 2 L( θ) = 1 n X X 0. Convexity Function: L(θ ) L(θ) + L(θ) (θ θ). Tuo Zhao Lecture 1: Supervised Learning 10/62
11 Convex v.s. Nonconvex Optimization Stationary Solutions: L( θ) = 0 Local Maximum Saddle Points Local Optimum Global Optimum Global Optimum Easy but restrictive Difficult but flexible We may get stuck at a local optimum or saddle point for nonconvex optimization. Tuo Zhao Lecture 1: Supervised Learning 11/62
12 Maximum Likelihood Estimation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where ɛ N(0, σ 2 I n ). y = Xθ + ɛ, Likelihood Function ( L(θ) = (2πσ 2 ) n 2 exp 1 ) 2σ 2 (y Xθ) (y Xθ) Maximum Log Likelihood Estimation: θ MLE = arg max log L(θ) θ = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Tuo Zhao Lecture 1: Supervised Learning 12/62
13 Maximum Likelihood Estimation Maximum Log Likelihood Estimation: θ MLE = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Given σ 2 as some unknown constant, θ MLE = arg max 1 θ 2n y 1 Xθ 2 2 = arg min θ 2n y Xθ 2 2. Probabilistic Interpretation: Simple and illustrative. Restrictive and potentially misleading. Remember t-test? What if the model is wrong? Tuo Zhao Lecture 1: Supervised Learning 13/62
14 Computational Cost of OLS The number of basic operations, e.g., addition, subtraction, multiplication, division. Matrix Multiplication: X X: O(nd 2 ) Matrix Inverse: (X X) 1 : O(d 3 ) Matrix Vector Multiplication: X y: O(nd) Matrix Vector Multiplication: [(X X) 1 ][X y]: O(d 2 ) Overall computational cost: O(nd 2 ), given n d Tuo Zhao Lecture 1: Supervised Learning 14/62
15 Scalability and Efficiency of OLS Simple Closed-form Overall computational cost: O(nd 2 ). Massive data: Both n and d are large. Not very efficient and scalable. Better ways to improve the computation? Tuo Zhao Lecture 1: Supervised Learning 15/62
16 Optimization for Linear Regression
17 Vanilla Gradient Descent θ (k+1) = θ (k) η k L(θ (k) ). η k > 0 the step size parameter (fixed or line search) Stop when the gradient is small: L(θ (K) ) δ 2 f( ) (k) rf( (k) ) rf( b )=0 Tuo Zhao Lecture 1: Supervised Learning 17/62
18 Computational Cost of VGD Gradient: L(θ (k) ) = 1 n X (Xθ (k) y) Matrix Vector Multiplication: Xθ (k) : O(nd) Vector Subtraction: Xθ (k) y: O(n) Matrix Vector Multiplication: X (Xθ (k) y): O(nd) Overall computational cost per iteration: O(nd) Better than O(nd 2 ), but how many iterations? Tuo Zhao Lecture 1: Supervised Learning 18/62
19 Rate of Convergence What are Good Algorithms? Asymptotic Convergence: θ (k) θ as k? Nonasymptotic Rate of Convergence: The optimization error after k iterations Example: Gap in Objective Value Sublinear Convergence f(θ (k) ) f( θ) = O ( L/k 2) v.s. O (L/k), where L is some constant depending on the problem. Example: Gap in Parameter Linear Convergence θ (k) θ 2 ((1 = O 1/κ) k) ( (1 ) ) k v.s. O 1/ κ 2 where κ is some constant depending on the problem. Tuo Zhao Lecture 1: Supervised Learning 19/62
20 Iteration Complexity of Gradient Descent Iteration Complexity We need at most iterations such that ( ( )) 1 K = O κ log ε θ (K) θ 2 ε, 2 where κ is some constant depending on the problem. What is κ? It is related to smoothness and convexity. Tuo Zhao Lecture 1: Supervised Learning 20/62
21 Strong Convexity There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + µ θ θ L( 0 ) L( )+rl( ) > ( 0 )+ µ 2 k 0 k 2 2 L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 21/62
22 Strong Smoothness There exists a constant L such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + L θ θ L( )+rl( ) > ( 0 )+ L 2 k 0 k 2 2 L( 0 ) L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 22/62
23 Condition Number κ = L/µ f(θ) = 0.9 θ θ2 2 f(θ) = 0.5 θ θ2 2 Tuo Zhao Lecture 1: Supervised Learning 23/62
24 Vector Field Representation f(θ) = 0.9 θ θ2 2 f(θ) = 0.5 θ θ2 2 Tuo Zhao Lecture 1: Supervised Learning 24/62
25 Understanding Regularity Conditions Mean Value Theorem: There exists a constant z [0, 1] such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) = 1 2 (θ θ) 2 L( θ)(θ θ), where θ is a convex combination: θ = zθ + (1 z)θ. Hessian Matrix for OLS: Control the reminder: Λ min ( 1 n X X) }{{} µ 2 L( θ) = 1 n X X. (θ θ) 2 L( θ)(θ θ) θ θ 2 2 Λ max ( 1 n X X). }{{} L Tuo Zhao Lecture 1: Supervised Learning 25/62
26 Understanding Gradient Descent Algorithms Iteratively Minimize Quadratic Approximation At the (t + 1)-th iteration, we consider Q(θ; θ (k) ) = L(θ (k) ) + L(θ (k) )(θ θ (k) ) + L 2 We have We take Q(θ; θ (k) ) L(θ) and Q(θ (k) ; θ (k) ) = L(θ (k) ). θ (k+1) = arg min Q(θ; θ (k) ) = θ (k) 1 θ L L(θ(k) ). θ θ (k) 2 2. Tuo Zhao Lecture 1: Supervised Learning 26/62
27 Backtracking Line Search The worst case fixed step size: η k = 1/L. At the (k + 1)-th iteration, we take η k = η k 1, i.e., θ (k+1) = θ (k) η k L(θ (k) ) if Q(θ (k+1) ; θ (k) ) L(θ). Otherwise, we take η k+1 = (1 δ) m η k, where δ > 0 and m is the smallest positive integer such that Q(θ (k+1) ; θ (k) ) L(θ). Tuo Zhao Lecture 1: Supervised Learning 27/62
28 Backtracking Line Search k = k 1 k = k 1 (k) k =0.75 k 1 Tuo Zhao Lecture 1: Supervised Learning 28/62
29 Tradeoff Statistics and Computation
30 High Precision or Low Precision Can we tolerate a large ε? From a learning perspective, our interest is θ, not θ. Error decomposition: θ (K) θ 2 θ (K) θ + θ θ 2. }{{} 2 }{{} Opt. Error Stat. Error High precision expects something like θ (K) θ Does it make any difference? Tuo Zhao Lecture 1: Supervised Learning 30/62
31 High Precision or Low Precision The statistical error is undefeatable! Error in Logarithmic Scale Statistical Error Optimization Error Number of Iterations Tuo Zhao Lecture 1: Supervised Learning 31/62
32 Tradeoff Statistical and Optimization Errors The statistical error is undefeatable! The statistical error of the optimal solution: E θ θ 2 (X = E X) 1 X ɛ ( σ = σ 2 tr[(x X) 1 2 ) d ] = O n We only need θ (K) θ θ θ 2. 2 Given K = O ( κ log ( n σ d)), we have 2 ( E θ (K) θ 2 σ 2 ) = O d. 2 n Tuo Zhao Lecture 1: Supervised Learning 32/62
33 Agnostic Learning All models are wrong, but some are useful! Data generating process: (X, Y ) D. The oracle model: f oracle (X) = X θ oracle, where θ oracle = arg min E D (Y X θ) 2. θ The estimated model: f(x) = X θ 1 θ = arg min θ 2n y Xθ 2 2, and (x 1, y 1 ),..., (x n, y n ) D At the K-th iteration: f (K) (X) = X θ (K) Tuo Zhao Lecture 1: Supervised Learning 33/62
34 Agnostic Learning (See more details in CS-7545) All models are wrong, but some are useful! Estimated model True model Oracle model All linear models Tuo Zhao Lecture 1: Supervised Learning 34/62
35 Agnostic Learning Approximation error: E D (Y f oracle (X)) 2 Estimation error : E D ( f(x) f oracle (X)) 2 Optimization error : E D (f (K) (X) f(x)) 2 Decomposition of the statistical error: E D (Y f (K) (X)) 2 E D (Y f oracle (X)) 2 + E D ( f(x) f oracle (X)) 2 + E D (f (K) (X) f(x)) 2 How should we choose ε? Tuo Zhao Lecture 1: Supervised Learning 35/62
36 Scalable Computation of Linear Regression
37 Stochastic Approximation What if n is too large? Empirical Risk Minimization: L(θ) = 1 n n i=1 l i(θ). For Least Square Regression: l i (θ) = 1 2 (y i x i θ) 2 or l i (θ) = 1 (y j x j θ) 2. 2 M i j M i Randomly sample i from 1,..., n with equal probability, E i l i (θ) = L(θ) and E l i (θ) L(θ) 2 2 M 2. Stochastic Gradient (SG): replace L(θ) with l i (θ) θ (k+1) = θ (k) η k l it (θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 37/62
38 Why Stochastic Gradient? Perturbed Descent Directions Tuo Zhao Lecture 1: Supervised Learning 38/62
39 Convergence of Stochastic Gradient Algorithms How many iterations do we need? A sequence of decreasing step size parameters: η k 1 kµ Given a pre-specified error ε, we need ( M 2 + L 2 ) K = O µ 2 ε iterations such that E θ (K) θ 2 ε, where 2 θ(k) = 1 T T θ (k). t=1 When µ 2 εn (M 2 + L 2 ), i.e., n is super large, ( d(m 2 + L 2 ) ) O µ 2 v.s. Õ (κnd) v.s. O(nd 2 ). ε Tuo Zhao Lecture 1: Supervised Learning 39/62
40 Why decreasing step size? Control Variance + Sufficient Descent Convergence Intuition: θ (k+1) = θ (k) η L(θ (k) ) + η( L(θ (k) ) l }{{} i (θ (k) )) }{{} Descent Error Not summable: Sufficient exploration, η k =. t=1 Square summable: Diminishing Variance, ηk 2 <. t=1 Tuo Zhao Lecture 1: Supervised Learning 40/62
41 Minibatch Variance Reduction Mini-batch SGD: If M i, then M 2 M i means more computational cost per iteration. M 2 means fewer iterations. Tuo Zhao Lecture 1: Supervised Learning 41/62
42 Variance Reduction by Control Variates Stochastic Variance Reduced Gradient Algorithm (SVRG): At the k-th epoch, θ = θ [k], θ (0) = θ [k]. At the k-th iteration of the k-th epoch, θ (k+1) = θ (k) η k ( l i (θ (k) ) l i ( θ) + L( θ)). After m iterations of the k-th epoch, θ [k+1] = θ (m). Tuo Zhao Lecture 1: Supervised Learning 42/62
43 Strong Smoothness and Convexity Regularity Conditions (Strong Smoothness) There exist constants L i s such that for any θ and θ, we have l i (θ ) l i (θ) l i (θ) (θ θ) L i θ θ (Strong Convexity) There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) µ θ θ Condition Number κ max = max i L i µ κ = L µ. Tuo Zhao Lecture 1: Supervised Learning 43/62
44 Why does SVRG work? The strong smoothness implies: l i (θ (k) θ ) l i ( θ) L (k) max θ 2 2 Bias correction: E[ l i (θ (k) ) l i ( θ)] = L(θ (k) ) L( θ). Variance Reduction: As θ (k) θ and θ θ E l i (θ (k) ) l i ( θ) + L( θ) L(θ (k) ) 2 2 E l i (θ (k) ) l i ( θ) Tuo Zhao Lecture 1: Supervised Learning 44/62
45 Convergence of SVRG How many iterations do we need? Fixed step size parameter: η k 1 L max Given a pre-specified error ε and m κ max, we need ( ( )) 1 K = O log ε epochs such that E θ [K] θ 2 ε. 2 Total number of operations: ( ) dm 2 Õ (nd + dκ max ) v.s. O µ 2 ε + dκ2 ε v.s. Õ (ndκ). Tuo Zhao Lecture 1: Supervised Learning 45/62
46 Comparison of GD, SGD and SVRG Tuo Zhao Lecture 1: Supervised Learning 46/62
47 Summary The empirical performance highly depends on the implementation. Cyclic or shuffled order (Not stochastic) in practice. Too many tuning parameters means my algorithm might only work in theory. Theoretical bounds can be very loose. The constant may matter a lot in practice. Good software engineers with B.S./M.S. degrees can earn much more than Ph.D. s, if they know how to code efficient algorithms. Tuo Zhao Lecture 1: Supervised Learning 47/62
48 Classification Analysis
49 Classification v.s. Regression Tuo Zhao Lecture 1: Supervised Learning 49/62
50 Logistic Regression Given x 1,..., x n R d and θ R d, y i Bernoulli(h(x i θ )) for i = 1,..., n, h: (, ) [0, 1] Logistic/Sigmoid function h(z) = exp( z). Remark: h(0) = 0.5, h( ) = 0 and h( ) = 1. Tuo Zhao Lecture 1: Supervised Learning 50/62
51 Logistic/Sigmoid Function Tuo Zhao Lecture 1: Supervised Learning 51/62
52 Logistic Regression Maximum Likelihood Estimation θ = arg max L(θ) θ = arg max θ = arg max θ = arg max θ = arg min θ log n i=1 ( ) h(x yi ( ) i θ) 1 h(x 1 yi i θ) n y i log h(x i θ) + (1 y i ) log(1 h(x i θ)) i=1 n i=1 1 n [ ] y i x i θ log(1 + exp(x i θ)) n i=1 [ ] log(1 + exp(x i θ)) y i x i θ. Tuo Zhao Lecture 1: Supervised Learning 52/62
53 Optimization for Logistic Regression Convex Problem? Let F(θ) = L(θ). 2 F(θ) = 1 n h(x i θ)(1 h(x i θ))x i x i 0. n i=1 No closed form solution. Gradient descent and stochastic gradient algorithms are applicable. Tuo Zhao Lecture 1: Supervised Learning 53/62
54 Prediction for Logistic Regression Prediction: Given x, we have P(y 1 = 1) = 1 + exp( θ x ) 0.5. Why linear classification? P(y = 1) 0.5 θ x 0 ŷ = sign( θ x ). Tuo Zhao Lecture 1: Supervised Learning 54/62
55 Logistic Loss Given x 1,..., x n R d, y 1,..., y n { 1, 1}, and θ : R d, P(y i = 1) = exp( x i θ ) for i = 1,..., n, An alternative formulation: 1 n θ = arg min log(1 + exp( y i x i θ)). θ n We can also use 0-1 loss: θ = arg min θ i=1 1 n n 1(sign(x i θ) y i ). i=1 Tuo Zhao Lecture 1: Supervised Learning 55/62
56 Loss Functions for Classification Tuo Zhao Lecture 1: Supervised Learning 56/62
57 Newton s Method At the k-th iteration, we take θ (k+1) = θ (k) [ η k 2 F(θ) ] 1 F(θ), where η k > 0 is a step size parameter. The Second Order Taylor s Approximation: θ (k+0.5) = arg min θ F(θ (k) ) + F(θ (k) )(θ θ (k) )+ Backtracking Line Search: (θ θ(k) ) 2 F(θ (k) )(θ θ (k) ). θ (k+1) = θ (k) + η k (θ (k+0.5) θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 57/62
58 Newton s Method Tuo Zhao Lecture 1: Supervised Learning 58/62
59 Newton s Method Sublinear + Quadratic Convergence: Given θ (k) θ 2 R 1, we have 2 θ (k+1) θ 2 θ (1 δ) (k) θ Given θ (k) θ 2 2 R, we have θ (k+1) θ 2 = O(1/k). 2 Iteration complexity: (Some parameters hidden for simplicity) ( ( ) ( )) 1 1 O log + log log. R ɛ Tuo Zhao Lecture 1: Supervised Learning 59/62
60 Newton s Method Advantage: More efficient for highly accurate solutions. Avoid extensively calculating log or exp functions. Taylor expansions combined with a table. Fewer line search steps. Due to quadratic convergence. Often more efficient than gradient descent. Tuo Zhao Lecture 1: Supervised Learning 60/62
61 Newton s Method Tuo Zhao Lecture 1: Supervised Learning 61/62
62 Newton s Method Disadvantage: Computing inverse Hessian matrices is expensive! Storing inverse Hessian matrices is expensive! Subsampled Newton: 1 θ (k+1) = θ (k) η k [H(θ )] (k) F(θ (k) ). Quasi-Newton Method: DFP, BFGS, Broyden, SR1 Use differences of gradient vectors to approximate Hessian matrices. Tuo Zhao Lecture 1: Supervised Learning 62/62
Regression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io
Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationLogistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48
Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCS229 Lecture notes. Andrew Ng
CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationPart 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationLogistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015
Logistic Regression Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Mohammad Emtiyaz Khan 2015 Classification with linear regression We can use y = 0 for C 1 and y = 1 for C 2 (or vice-versa), and simply use least-squares
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationGRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)
0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationAnnouncements Kevin Jamieson
Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationIntroduction to Machine Learning HW6
CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More information