Lecture 1: Supervised Learning

Size: px
Start display at page:

Download "Lecture 1: Supervised Learning"

Transcription

1 Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

2 ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised) Regression Analysis s start by talking about a few examples of supervised learning problems. pose we have a dataset giving the living areas and prices of 47 houses Portland, Oregon: e can plot this data:. Example: living areas and prices of 47 houses: Living area (feet 2 ) Price (1000$s) housing prices. We can plot this data: Living area (feet 2 ) Price (1000$s) where ɛ 700 i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. price (in $1000) housing prices square feet Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Given x 1,..., x n R d, y 1,..., y n R, and f : R d R, y i = f (x i ) + ɛ i for i = 1,..., n, 1. price (in $1000) Simple Linear Function: f (x i ) = x i θ Why is it called supervised learning? Tuo Zhao Lecture 1: Supervised Learning 2/

3 Why Supervised? Tuo Zhao Lecture 1: Supervised Learning 3/62

4 Play on Words? Two unknown functions f0, f 1 : Rd R? y i = 1(z i = 1) f1 (x i ) + 1(z i = 0) f0 (x i ) + ɛ i, where i = 1,..., n, and z i s are i.i.d. with P(z i = 1) = δ and P(z i = 0) = 1 δ for δ (0, 1). z i : Latent Variables. Supervised? Unsupervised? Tuo Zhao Lecture 1: Supervised Learning 4/62

5 Linear Regression

6 Linear Regression Given x 1,..., x n R d, y 1,..., y n R, and θ : R d, y i = x i θ + ɛ i for i = 1,..., n, where ɛ i s are i.i.d. with Eɛ i = 0, Eɛ 2 i = σ2 <. Ordinary Least Square Regression θ OLS 1 n = arg min (y i x i θ) 2. θ 2n i=1 Least Absolute Deviation Regression: θ LAD 1 n = arg min y i x i θ. θ n i=1 Tuo Zhao Lecture 1: Supervised Learning 6/62

7 Robust Regression Tuo Zhao Lecture 1: Supervised Learning 7/62

8 Linear Regression Matrix Notation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where Eɛ = 0, Eɛɛ = σ 2 I n. y = Xθ + ɛ, Ordinary Least Square Regression θ OLS = arg min θ Least Absolute Deviation Regression: θ LAD = arg min θ 1 2n y Xθ n y Xθ 1. Tuo Zhao Lecture 1: Supervised Learning 8/62

9 Least Square Regression Analytical Solution Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) First order optimality condition: L( θ) = 1 n X (X θ y) = 0 X X θ = X y. Analytical solution and unbiasedness: θ = (X X) 1 X y = (X X) 1 X (Xθ + ɛ) = θ + (X X) 1 X ɛ E ɛ θ = θ Tuo Zhao Lecture 1: Supervised Learning 9/62

10 Least Square Regression Convexity Ordinary Least Square Regression θ OLS 1 = arg min θ 2n y Xθ 2 2. }{{} L(θ) Second order optimality condition: 2 L( θ) = 1 n X X 0. Convexity Function: L(θ ) L(θ) + L(θ) (θ θ). Tuo Zhao Lecture 1: Supervised Learning 10/62

11 Convex v.s. Nonconvex Optimization Stationary Solutions: L( θ) = 0 Local Maximum Saddle Points Local Optimum Global Optimum Global Optimum Easy but restrictive Difficult but flexible We may get stuck at a local optimum or saddle point for nonconvex optimization. Tuo Zhao Lecture 1: Supervised Learning 11/62

12 Maximum Likelihood Estimation X = [x 1,..., x n ] R n d, y = [y 1,..., y n ] R n, where ɛ N(0, σ 2 I n ). y = Xθ + ɛ, Likelihood Function ( L(θ) = (2πσ 2 ) n 2 exp 1 ) 2σ 2 (y Xθ) (y Xθ) Maximum Log Likelihood Estimation: θ MLE = arg max log L(θ) θ = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Tuo Zhao Lecture 1: Supervised Learning 12/62

13 Maximum Likelihood Estimation Maximum Log Likelihood Estimation: θ MLE = arg max n θ 2 log(2πσ2 ) 1 2σ 2 y Xθ 2 2 Given σ 2 as some unknown constant, θ MLE = arg max 1 θ 2n y 1 Xθ 2 2 = arg min θ 2n y Xθ 2 2. Probabilistic Interpretation: Simple and illustrative. Restrictive and potentially misleading. Remember t-test? What if the model is wrong? Tuo Zhao Lecture 1: Supervised Learning 13/62

14 Computational Cost of OLS The number of basic operations, e.g., addition, subtraction, multiplication, division. Matrix Multiplication: X X: O(nd 2 ) Matrix Inverse: (X X) 1 : O(d 3 ) Matrix Vector Multiplication: X y: O(nd) Matrix Vector Multiplication: [(X X) 1 ][X y]: O(d 2 ) Overall computational cost: O(nd 2 ), given n d Tuo Zhao Lecture 1: Supervised Learning 14/62

15 Scalability and Efficiency of OLS Simple Closed-form Overall computational cost: O(nd 2 ). Massive data: Both n and d are large. Not very efficient and scalable. Better ways to improve the computation? Tuo Zhao Lecture 1: Supervised Learning 15/62

16 Optimization for Linear Regression

17 Vanilla Gradient Descent θ (k+1) = θ (k) η k L(θ (k) ). η k > 0 the step size parameter (fixed or line search) Stop when the gradient is small: L(θ (K) ) δ 2 f( ) (k) rf( (k) ) rf( b )=0 Tuo Zhao Lecture 1: Supervised Learning 17/62

18 Computational Cost of VGD Gradient: L(θ (k) ) = 1 n X (Xθ (k) y) Matrix Vector Multiplication: Xθ (k) : O(nd) Vector Subtraction: Xθ (k) y: O(n) Matrix Vector Multiplication: X (Xθ (k) y): O(nd) Overall computational cost per iteration: O(nd) Better than O(nd 2 ), but how many iterations? Tuo Zhao Lecture 1: Supervised Learning 18/62

19 Rate of Convergence What are Good Algorithms? Asymptotic Convergence: θ (k) θ as k? Nonasymptotic Rate of Convergence: The optimization error after k iterations Example: Gap in Objective Value Sublinear Convergence f(θ (k) ) f( θ) = O ( L/k 2) v.s. O (L/k), where L is some constant depending on the problem. Example: Gap in Parameter Linear Convergence θ (k) θ 2 ((1 = O 1/κ) k) ( (1 ) ) k v.s. O 1/ κ 2 where κ is some constant depending on the problem. Tuo Zhao Lecture 1: Supervised Learning 19/62

20 Iteration Complexity of Gradient Descent Iteration Complexity We need at most iterations such that ( ( )) 1 K = O κ log ε θ (K) θ 2 ε, 2 where κ is some constant depending on the problem. What is κ? It is related to smoothness and convexity. Tuo Zhao Lecture 1: Supervised Learning 20/62

21 Strong Convexity There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + µ θ θ L( 0 ) L( )+rl( ) > ( 0 )+ µ 2 k 0 k 2 2 L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 21/62

22 Strong Smoothness There exists a constant L such that for any θ and θ, we have L(θ ) L(θ) + L(θ) (θ θ) + L θ θ L( )+rl( ) > ( 0 )+ L 2 k 0 k 2 2 L( 0 ) L( )+rl( ) > ( 0 ) Tuo Zhao Lecture 1: Supervised Learning 22/62

23 Condition Number κ = L/µ f(θ) = 0.9 θ θ2 2 f(θ) = 0.5 θ θ2 2 Tuo Zhao Lecture 1: Supervised Learning 23/62

24 Vector Field Representation f(θ) = 0.9 θ θ2 2 f(θ) = 0.5 θ θ2 2 Tuo Zhao Lecture 1: Supervised Learning 24/62

25 Understanding Regularity Conditions Mean Value Theorem: There exists a constant z [0, 1] such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) = 1 2 (θ θ) 2 L( θ)(θ θ), where θ is a convex combination: θ = zθ + (1 z)θ. Hessian Matrix for OLS: Control the reminder: Λ min ( 1 n X X) }{{} µ 2 L( θ) = 1 n X X. (θ θ) 2 L( θ)(θ θ) θ θ 2 2 Λ max ( 1 n X X). }{{} L Tuo Zhao Lecture 1: Supervised Learning 25/62

26 Understanding Gradient Descent Algorithms Iteratively Minimize Quadratic Approximation At the (t + 1)-th iteration, we consider Q(θ; θ (k) ) = L(θ (k) ) + L(θ (k) )(θ θ (k) ) + L 2 We have We take Q(θ; θ (k) ) L(θ) and Q(θ (k) ; θ (k) ) = L(θ (k) ). θ (k+1) = arg min Q(θ; θ (k) ) = θ (k) 1 θ L L(θ(k) ). θ θ (k) 2 2. Tuo Zhao Lecture 1: Supervised Learning 26/62

27 Backtracking Line Search The worst case fixed step size: η k = 1/L. At the (k + 1)-th iteration, we take η k = η k 1, i.e., θ (k+1) = θ (k) η k L(θ (k) ) if Q(θ (k+1) ; θ (k) ) L(θ). Otherwise, we take η k+1 = (1 δ) m η k, where δ > 0 and m is the smallest positive integer such that Q(θ (k+1) ; θ (k) ) L(θ). Tuo Zhao Lecture 1: Supervised Learning 27/62

28 Backtracking Line Search k = k 1 k = k 1 (k) k =0.75 k 1 Tuo Zhao Lecture 1: Supervised Learning 28/62

29 Tradeoff Statistics and Computation

30 High Precision or Low Precision Can we tolerate a large ε? From a learning perspective, our interest is θ, not θ. Error decomposition: θ (K) θ 2 θ (K) θ + θ θ 2. }{{} 2 }{{} Opt. Error Stat. Error High precision expects something like θ (K) θ Does it make any difference? Tuo Zhao Lecture 1: Supervised Learning 30/62

31 High Precision or Low Precision The statistical error is undefeatable! Error in Logarithmic Scale Statistical Error Optimization Error Number of Iterations Tuo Zhao Lecture 1: Supervised Learning 31/62

32 Tradeoff Statistical and Optimization Errors The statistical error is undefeatable! The statistical error of the optimal solution: E θ θ 2 (X = E X) 1 X ɛ ( σ = σ 2 tr[(x X) 1 2 ) d ] = O n We only need θ (K) θ θ θ 2. 2 Given K = O ( κ log ( n σ d)), we have 2 ( E θ (K) θ 2 σ 2 ) = O d. 2 n Tuo Zhao Lecture 1: Supervised Learning 32/62

33 Agnostic Learning All models are wrong, but some are useful! Data generating process: (X, Y ) D. The oracle model: f oracle (X) = X θ oracle, where θ oracle = arg min E D (Y X θ) 2. θ The estimated model: f(x) = X θ 1 θ = arg min θ 2n y Xθ 2 2, and (x 1, y 1 ),..., (x n, y n ) D At the K-th iteration: f (K) (X) = X θ (K) Tuo Zhao Lecture 1: Supervised Learning 33/62

34 Agnostic Learning (See more details in CS-7545) All models are wrong, but some are useful! Estimated model True model Oracle model All linear models Tuo Zhao Lecture 1: Supervised Learning 34/62

35 Agnostic Learning Approximation error: E D (Y f oracle (X)) 2 Estimation error : E D ( f(x) f oracle (X)) 2 Optimization error : E D (f (K) (X) f(x)) 2 Decomposition of the statistical error: E D (Y f (K) (X)) 2 E D (Y f oracle (X)) 2 + E D ( f(x) f oracle (X)) 2 + E D (f (K) (X) f(x)) 2 How should we choose ε? Tuo Zhao Lecture 1: Supervised Learning 35/62

36 Scalable Computation of Linear Regression

37 Stochastic Approximation What if n is too large? Empirical Risk Minimization: L(θ) = 1 n n i=1 l i(θ). For Least Square Regression: l i (θ) = 1 2 (y i x i θ) 2 or l i (θ) = 1 (y j x j θ) 2. 2 M i j M i Randomly sample i from 1,..., n with equal probability, E i l i (θ) = L(θ) and E l i (θ) L(θ) 2 2 M 2. Stochastic Gradient (SG): replace L(θ) with l i (θ) θ (k+1) = θ (k) η k l it (θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 37/62

38 Why Stochastic Gradient? Perturbed Descent Directions Tuo Zhao Lecture 1: Supervised Learning 38/62

39 Convergence of Stochastic Gradient Algorithms How many iterations do we need? A sequence of decreasing step size parameters: η k 1 kµ Given a pre-specified error ε, we need ( M 2 + L 2 ) K = O µ 2 ε iterations such that E θ (K) θ 2 ε, where 2 θ(k) = 1 T T θ (k). t=1 When µ 2 εn (M 2 + L 2 ), i.e., n is super large, ( d(m 2 + L 2 ) ) O µ 2 v.s. Õ (κnd) v.s. O(nd 2 ). ε Tuo Zhao Lecture 1: Supervised Learning 39/62

40 Why decreasing step size? Control Variance + Sufficient Descent Convergence Intuition: θ (k+1) = θ (k) η L(θ (k) ) + η( L(θ (k) ) l }{{} i (θ (k) )) }{{} Descent Error Not summable: Sufficient exploration, η k =. t=1 Square summable: Diminishing Variance, ηk 2 <. t=1 Tuo Zhao Lecture 1: Supervised Learning 40/62

41 Minibatch Variance Reduction Mini-batch SGD: If M i, then M 2 M i means more computational cost per iteration. M 2 means fewer iterations. Tuo Zhao Lecture 1: Supervised Learning 41/62

42 Variance Reduction by Control Variates Stochastic Variance Reduced Gradient Algorithm (SVRG): At the k-th epoch, θ = θ [k], θ (0) = θ [k]. At the k-th iteration of the k-th epoch, θ (k+1) = θ (k) η k ( l i (θ (k) ) l i ( θ) + L( θ)). After m iterations of the k-th epoch, θ [k+1] = θ (m). Tuo Zhao Lecture 1: Supervised Learning 42/62

43 Strong Smoothness and Convexity Regularity Conditions (Strong Smoothness) There exist constants L i s such that for any θ and θ, we have l i (θ ) l i (θ) l i (θ) (θ θ) L i θ θ (Strong Convexity) There exists a constant µ such that for any θ and θ, we have L(θ ) L(θ) L(θ) (θ θ) µ θ θ Condition Number κ max = max i L i µ κ = L µ. Tuo Zhao Lecture 1: Supervised Learning 43/62

44 Why does SVRG work? The strong smoothness implies: l i (θ (k) θ ) l i ( θ) L (k) max θ 2 2 Bias correction: E[ l i (θ (k) ) l i ( θ)] = L(θ (k) ) L( θ). Variance Reduction: As θ (k) θ and θ θ E l i (θ (k) ) l i ( θ) + L( θ) L(θ (k) ) 2 2 E l i (θ (k) ) l i ( θ) Tuo Zhao Lecture 1: Supervised Learning 44/62

45 Convergence of SVRG How many iterations do we need? Fixed step size parameter: η k 1 L max Given a pre-specified error ε and m κ max, we need ( ( )) 1 K = O log ε epochs such that E θ [K] θ 2 ε. 2 Total number of operations: ( ) dm 2 Õ (nd + dκ max ) v.s. O µ 2 ε + dκ2 ε v.s. Õ (ndκ). Tuo Zhao Lecture 1: Supervised Learning 45/62

46 Comparison of GD, SGD and SVRG Tuo Zhao Lecture 1: Supervised Learning 46/62

47 Summary The empirical performance highly depends on the implementation. Cyclic or shuffled order (Not stochastic) in practice. Too many tuning parameters means my algorithm might only work in theory. Theoretical bounds can be very loose. The constant may matter a lot in practice. Good software engineers with B.S./M.S. degrees can earn much more than Ph.D. s, if they know how to code efficient algorithms. Tuo Zhao Lecture 1: Supervised Learning 47/62

48 Classification Analysis

49 Classification v.s. Regression Tuo Zhao Lecture 1: Supervised Learning 49/62

50 Logistic Regression Given x 1,..., x n R d and θ R d, y i Bernoulli(h(x i θ )) for i = 1,..., n, h: (, ) [0, 1] Logistic/Sigmoid function h(z) = exp( z). Remark: h(0) = 0.5, h( ) = 0 and h( ) = 1. Tuo Zhao Lecture 1: Supervised Learning 50/62

51 Logistic/Sigmoid Function Tuo Zhao Lecture 1: Supervised Learning 51/62

52 Logistic Regression Maximum Likelihood Estimation θ = arg max L(θ) θ = arg max θ = arg max θ = arg max θ = arg min θ log n i=1 ( ) h(x yi ( ) i θ) 1 h(x 1 yi i θ) n y i log h(x i θ) + (1 y i ) log(1 h(x i θ)) i=1 n i=1 1 n [ ] y i x i θ log(1 + exp(x i θ)) n i=1 [ ] log(1 + exp(x i θ)) y i x i θ. Tuo Zhao Lecture 1: Supervised Learning 52/62

53 Optimization for Logistic Regression Convex Problem? Let F(θ) = L(θ). 2 F(θ) = 1 n h(x i θ)(1 h(x i θ))x i x i 0. n i=1 No closed form solution. Gradient descent and stochastic gradient algorithms are applicable. Tuo Zhao Lecture 1: Supervised Learning 53/62

54 Prediction for Logistic Regression Prediction: Given x, we have P(y 1 = 1) = 1 + exp( θ x ) 0.5. Why linear classification? P(y = 1) 0.5 θ x 0 ŷ = sign( θ x ). Tuo Zhao Lecture 1: Supervised Learning 54/62

55 Logistic Loss Given x 1,..., x n R d, y 1,..., y n { 1, 1}, and θ : R d, P(y i = 1) = exp( x i θ ) for i = 1,..., n, An alternative formulation: 1 n θ = arg min log(1 + exp( y i x i θ)). θ n We can also use 0-1 loss: θ = arg min θ i=1 1 n n 1(sign(x i θ) y i ). i=1 Tuo Zhao Lecture 1: Supervised Learning 55/62

56 Loss Functions for Classification Tuo Zhao Lecture 1: Supervised Learning 56/62

57 Newton s Method At the k-th iteration, we take θ (k+1) = θ (k) [ η k 2 F(θ) ] 1 F(θ), where η k > 0 is a step size parameter. The Second Order Taylor s Approximation: θ (k+0.5) = arg min θ F(θ (k) ) + F(θ (k) )(θ θ (k) )+ Backtracking Line Search: (θ θ(k) ) 2 F(θ (k) )(θ θ (k) ). θ (k+1) = θ (k) + η k (θ (k+0.5) θ (k) ). Tuo Zhao Lecture 1: Supervised Learning 57/62

58 Newton s Method Tuo Zhao Lecture 1: Supervised Learning 58/62

59 Newton s Method Sublinear + Quadratic Convergence: Given θ (k) θ 2 R 1, we have 2 θ (k+1) θ 2 θ (1 δ) (k) θ Given θ (k) θ 2 2 R, we have θ (k+1) θ 2 = O(1/k). 2 Iteration complexity: (Some parameters hidden for simplicity) ( ( ) ( )) 1 1 O log + log log. R ɛ Tuo Zhao Lecture 1: Supervised Learning 59/62

60 Newton s Method Advantage: More efficient for highly accurate solutions. Avoid extensively calculating log or exp functions. Taylor expansions combined with a table. Fewer line search steps. Due to quadratic convergence. Often more efficient than gradient descent. Tuo Zhao Lecture 1: Supervised Learning 60/62

61 Newton s Method Tuo Zhao Lecture 1: Supervised Learning 61/62

62 Newton s Method Disadvantage: Computing inverse Hessian matrices is expensive! Storing inverse Hessian matrices is expensive! Subsampled Newton: 1 θ (k+1) = θ (k) η k [H(θ )] (k) F(θ (k) ). Quasi-Newton Method: DFP, BFGS, Broyden, SR1 Use differences of gradient vectors to approximate Hessian matrices. Tuo Zhao Lecture 1: Supervised Learning 62/62

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Computing the MLE and the EM Algorithm

Computing the MLE and the EM Algorithm ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Logistic Regression Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Mohammad Emtiyaz Khan 2015 Classification with linear regression We can use y = 0 for C 1 and y = 1 for C 2 (or vice-versa), and simply use least-squares

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Introduction to Machine Learning HW6

Introduction to Machine Learning HW6 CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information