Optimization for well-behaved problems

Similar documents
Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

Lecture 11: October 2

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Introduction to gradient descent

. D Matrix Calculus D 1

CSC321 Lecture 2: Linear Regression

Matrix Algebra, Class Notes (part 2) by Hrishikesh D. Vinod Copyright 1998 by Prof. H. D. Vinod, Fordham University, New York. All rights reserved.

Neural Network Training

Least Squares Regression

Gradient Descent. Stephan Robert

More on Neural Networks

Announcements. Topics: Homework:

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Unit #10 : Graphs of Antiderivatives; Substitution Integrals

CSC321 Lecture 6: Backpropagation

Gradient Descent. Model

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Neural Networks. Intro to AI Bert Huang Virginia Tech

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 20: November 1st

A Derivation of the 3-entity-type model

Graphs of Antiderivatives, Substitution Integrals

CPSC 540: Machine Learning

The Relation between the Integral and the Derivative Graphs. Unit #10 : Graphs of Antiderivatives, Substitution Integrals

Lecture 2: Linear regression

Lecture 9: September 28

Neural Networks, Convexity, Kernels and Curses

Lecture 4 Backpropagation

Math 651 Introduction to Numerical Analysis I Fall SOLUTIONS: Homework Set 1

8.1 Concentration inequality for Gaussian random matrix (cont d)

Optimization and (Under/over)fitting

Stochastic Calculus. Kevin Sinclair. August 2, 2016

Announcements. Topics: Homework:

Matrix Derivatives and Descent Optimization Methods

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Vector, Matrix, and Tensor Derivatives

M2PM1 Analysis II (2008) Dr M Ruzhansky List of definitions, statements and examples Preliminary version

Neural Networks: Backpropagation

SEBASTIAN RASCHKA. Introduction to Artificial Neural Networks and Deep Learning. with Applications in Python

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Computing Neural Network Gradients

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

Artificial Neural Networks

Optimization and Gradient Descent

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Neural Networks in Structured Prediction. November 17, 2015

OR MSc Maths Revision Course

Regression with Numerical Optimization. Logistic

CPSC 540: Machine Learning

10-725/ Optimization Midterm Exam

Comparative Statics. Autumn 2018

Linear and non-linear programming

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Neural Networks Teaser

CSC 411 Lecture 10: Neural Networks

Logistic Regression Trained with Different Loss Functions. Discussion

On the INFOMAX algorithm for blind signal separation

Least Squares Regression

g(t) = f(x 1 (t),..., x n (t)).

Independent Components Analysis

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Error Functions & Linear Regression (1)

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Math Advanced Calculus II

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

g(2, 1) = cos(2π) + 1 = = 9

Sec 4.1 Limits, Informally. When we calculated f (x), we first started with the difference quotient. f(x + h) f(x) h

Linear Regression, Neural Networks, etc.

Hands-on Matrix Algebra Using R

A special rule, the chain rule, exists for differentiating a function of another function. This unit illustrates this rule.

MATH 680 Fall November 27, Homework 3

Mathematics for Economics ECON MA/MSSc in Economics-2017/2018. Dr. W. M. Semasinghe Senior Lecturer Department of Economics

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning. Lecture 2: Linear regression. Feng Li.

Second-order approximation of dynamic models without the use of tensors

Lecture 4: Least Squares (LS) Estimation

Lagrange Multipliers

Vector Derivatives and the Gradient

Unit #10 : Graphs of Antiderivatives, Substitution Integrals

Computational Graphs, and Backpropagation

Meaning of the Hessian of a function in a critical point

Neural Networks (and Gradient Ascent Again)

Parameter estimation Conditional risk

Lecture 22: Integration by parts and u-substitution

Gradient Descent. Sargur Srihari

Least Mean Squares Regression. Machine Learning Fall 2018

Lecture 5 Multivariate Linear Regression

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Topic 2: Logistic Regression

(0, 0), (1, ), (2, ), (3, ), (4, ), (5, ), (6, ).

Transcription:

Optimization for well-behaved problems For statistical learning problems, well-behaved means: signal to noise ratio is decently high correlations between predictor variables are under control number of predictors p can be larger than number of observations n, but not absurdly so For well-behaved learning problems, people have observed that gradient or generalized gradient descent can converge extremely quickly (much more so than predicted by O(1/k) rate) Largely unexplained by theory, topic of current research. E.g., very recent work 4 shows that for some well-behaved problems, w.h.p.: x (k) x 2 c k x (0) x 2 + o(x x true 2 ) 4 Agarwal et al. (2012), Fast global convergence of gradient methods for high-dimensional statistical recovery

Administrivia HW2 out as of this past Tuesday due 10/9 Scribing Scribes 1 6 ready soon; handling errata missing days: 11/6, 12/4, and 12/6 Projects: you should expect to be contacted by TA mentor in next weeks project milestone: 10/30 2

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani

Matrix calculus pain Take derivatives of fns involving matrices: write as huge multiple summations w/ lots of terms take derivative as usual, introducing more terms case statements for i = j vs. i j try to recognize that output is equivalent to some human-readable form hope for no indexing errors Is there a better way? 4

Differentials Assume f sufficiently nice Taylor: f(y) = f(x) + f (x) (y x) + r(y x) with r(y x) / y x 0 as y x Notation: 5

Definition Write dx = y x df = f(y) f(x) Suppose df = with and Then: 6

Matrix differentials For matrix X or matrix-valued function F(X): dx = df = where and Examples: 7

Working with differentials Linearity: d(f(x) + g(x)) = d(k f(x)) = Common linear operators: reshape(a, [m n k ]) vec(a) = A(:) = reshape(a, [], 1) tr(a) = A T 8

Reshape >> reshape(1:24, [2 3 4]) ans(:,:,1) = 1 3 5 2 4 6 ans(:,:,2) = 7 9 11 8 10 12 ans(:,:,3) = 13 15 17 14 16 18 ans(:,:,4) = 19 21 23 20 22 24 9

Working with differentials Chain rule: L(x) = f(g(x)) want: have: 10

Working with differentials Product rule: L(x) = c(f(x), g(x)) where c is bilinear = linear in each argument (with other argument fixed) e.g., L(x) = f(x)g(x): f, g scalars, vectors, or matrices 11

Lots of products Cross product: d(a b) = Hadamard product A B = A.* B (A B)ij = d(a B) = Kronecker product d(a B) = Frobenius product A:B = Khatri-Rao product: d(a*b) = 12

Kronecker product >> A = reshape(1:6, 2, 3) A = 1 3 5 2 4 6 >> B = 2*ones(2) B = 2 2 2 2 >> kron(a, B) ans = 2 2 6 6 10 10 2 2 6 6 10 10 4 4 8 8 12 12 4 4 8 8 12 12 >> kron(b, A) ans = 2 6 10 2 6 10 4 8 12 4 8 12 2 6 10 2 6 10 4 8 12 4 8 12 13

Hadamard product a, b vectors a b = diag(a) diag(b) = tr(diag(a) diag(b)) = tr(diag(b)) = 14

Some examples L = (Y XW) T (Y XW): differential wrt W dl = 15

Some examples L = dl = L = dl = 16

Trace tr(a) = Aii d tr(f(x)) = tr(x) = tr(x T ) = Frobenius product: A:B = 17

Trace rotation tr(ab) = tr(abc) = size(a): size(b): size(c): 18

More Identities: for a matrix X, d(x -1 ) = d(det X) = d(ln det X ) = 19

Example: linear regression Training examples: Input feature vectors: Target vectors: Weight matrix: minw L = as matrix: 20

Linear regression L = Y WX 2 = F dl = 21

Identification theorems Sometimes useful to go back and forth between differential & ordinary notations not always possible: e.g., d(x T X) = Six common cases (ID thms): ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx) vector f matrix F df = a dx df = A dx df = A dx 22

10 Ex: Infomax ICA Training examples xi R d, i = 1:n Transformation yi = g(wxi) W R d d g(z) = Want: 5 0 5 10 10 5 0 5 10 10 5 0 5 Wxi 10 10 5 0 5 10 xi 0.8 0.6 0.4 0.2 yi 23

10 Ex: Infomax ICA yi = g(wxi) dyi = 5 0 5 10 10 5 0 5 10 10 xi 5 Method: maxw i ln(p(yi)) where P(yi) = 0 5 Wxi 10 10 5 0 5 10 0.8 0.6 0.4 0.2 yi 24

Gradient L = ln det Ji yi = g(wxi) dyi = Ji dxi i 25

Gradient Ji = diag(ui) W dji = diag(ui) dw + diag(vi) diag(dw xi) W dl = 26

Natural gradient Matrix W, gradient dl = G:dW = tr(g T dw) step S = arg maxs L(S) = tr(g T S) SW -1 2 /2 scalar case: L = gs s 2 / 2w 2 L = dl = F 27

ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 28

ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 28

ICA on natural image patches 29

ICA on natural image patches 30

More info Minka s cheat sheet: http://research.microsoft.com/en-us/um/people/minka/ papers/matrix/ Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed. http://www.amazon.com/differential-calculus- Applications-Statistics-Econometrics/dp/047198633X Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. 31