Relative Loss Bounds for Multidimensional Regression Problems
|
|
- Clarissa Webb
- 5 years ago
- Views:
Transcription
1 Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee
2 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2
3 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing
4 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z)
5 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z) Squared loss and Logistic transfer do not match
6 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima
7 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima Motivates matching of loss function and transfer function
8 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i ))
9 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum
10 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ
11 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) exp( w T x i )
12 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) exp( w T x i ) We ignore the statistical connections for now
13 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y)
14 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= z 2 /2 L ψ (x,y)= x y 2 h(z) y x Squared Euclidean distance is a Bregman divergence
15 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)=z log z L ψ (x,y)=x log x y x+y h(z) y x Relative Entropy (KL-divergence) is another Bregman divergence
16 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= log z y x h(z) L ψ (x,y)= x y log x y 1 Itakura-Saito distance is another Bregman divergence.
17 Properties of Bregman Divergences L ψ (x,y) 0, and equals 0 iff x = y Not necessarily symmetric, i.e., L ψ (x,y) L ψ (y,x) Not a metric, i.e., triangle inequality does not hold Strictly convex in the first argument Not necessarily convex in the second argument Three-point property L ψ (x,y) = L ψ (x,z) + L ψ (z,y) (x z) T (ψ(y) ψ(z)) Generalized Pythagoras Theorem: L ψ (x,y) = L ψ (x,z) + L ψ (z,y)
18 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ)
19 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ
20 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ)
21 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ) As a result, L ψ (x 1,x 2 ) = f(x 1 ) + f (λ 2 ) x T 1 λ 2 = L φ (λ 2, λ 1 )
22 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i ))
23 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum
24 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss
25 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ
26 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1
27 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ
28 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ Set up the problem as L ψ (y i, φ(w T x i ))
29 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ Set up the problem as L ψ (y i, φ(w T x i )) Note that ψ = f and φ = f are inverses of each other
30 Online Learning: The Basic Setting Prediction proceeds in trials t = 1,...,T At trial t, the algorithm Gets an input x t Makes a prediction ŷ t = φ(w T t x t ) Receives true output y t Updates weight vector to w t+1 Consider the loss function L ψ (y t, φ(w T t x t )) = L φ (w T t x t, ψ(y t )) = (y t w T t x t ) 2 The corresponding gradient descent update, with ŷ t = w T t x t w t+1 = w t η(ŷ t y t )x t What happens in the general case?
31 The General Setting General: Multidimensional output, arbitrary matching loss
32 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t
33 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t )
34 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t
35 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t The cost function to minimize U(W) = L φ (W, W t ) + ηl ψ (y t, φ(wx t )) = L φ (W, W t ) + ηl φ (Wx t, ψ(y t ))
36 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t
37 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 )
38 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t
39 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 )
40 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation
41 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation The additive update is on the conjugate space
42 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation The additive update is on the conjugate space GD and EG are special cases
43 1 Relative Loss Bounds Cumulative loss of General Additive (GA) algorithm Loss φ (GA, S) = T L ψ (y t, φ(w t x t )) t=1 Best cumulative loss looking back Loss φ (W, S) = T L ψ (y t, φ(wx t )) t=1 Two types of relative loss bounds Loss φ (GA, S) p Loss φ (W, S) + q Loss φ (GA, S) Loss φ (W, S) + q 1 Loss φ (W, S) + q 2
44 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z)
45 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ)
46 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x)
47 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i ))
48 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i )) This is the problem we discussed today
Bregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationBregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008
Bregman Divergences Barnabás Póczos RLAI Tea Talk UofA, Edmonton Aug 5, 2008 Contents Bregman Divergences Bregman Matrix Divergences Relation to Exponential Family Applications Definition Properties Generalization
More informationRelative Loss Bounds for Multidimensional Regression Problems*
Machine Learning, 45, 301 329, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Relative Loss Bounds for Multidimensional Regression Problems* J. KIVINEN jyrki.kivinen@faceng.anu.edu.au
More informationKernel Learning with Bregman Matrix Divergences
Kernel Learning with Bregman Matrix Divergences Inderjit S. Dhillon The University of Texas at Austin Workshop on Algorithms for Modern Massive Data Sets Stanford University and Yahoo! Research June 22,
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationInformation geometry of mirror descent
Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationLecture 4 Backpropagation
Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationBelief Propagation, Information Projections, and Dykstra s Algorithm
Belief Propagation, Information Projections, and Dykstra s Algorithm John MacLaren Walsh, PhD Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationNeural networks III: The delta learning rule with semilinear activation function
Neural networks III: The delta learning rule with semilinear activation function The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. We
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationPart 2: Generalized output representations and structure
Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta Output transformation Output transformation What if targets y special? E.g. what if y nonnegative y 0 y probability
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationPreliminary draft only: please check for final version
ARE211, Fall2012 CALCULUS4: THU, OCT 11, 2012 PRINTED: AUGUST 22, 2012 (LEC# 15) Contents 3. Univariate and Multivariate Differentiation (cont) 1 3.6. Taylor s Theorem (cont) 2 3.7. Applying Taylor theory:
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More information1 Review and Overview
DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationMachine learning with quantum relative entropy
Journal of Physics: Conference Series Machine learning with quantum relative entropy To cite this article: Koji Tsuda 2009 J. Phys.: Conf. Ser. 143 012021 View the article online for updates and enhancements.
More informationThis article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing
More informationA Generalization of Principal Component Analysis to the Exponential Family
A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationInfinitely Imbalanced Logistic Regression
p. 1/1 Infinitely Imbalanced Logistic Regression Art B. Owen Journal of Machine Learning Research, April 2007 Presenter: Ivo D. Shterev p. 2/1 Outline Motivation Introduction Numerical Examples Notation
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationMatrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection Koji Tsuda, Gunnar Rätsch and Manfred K. Warmuth Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More information3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions
3. Convex functions Convex Optimization Boyd & Vandenberghe basic properties and examples operations that preserve convexity the conjugate function quasiconvex functions log-concave and log-convex functions
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationComments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert
Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be
More informationLecture 23: Online convex optimization Online convex optimization: generalization of several algorithms
EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationStatistical Ranking Problem
Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing
More informationGenerative Adversarial Networks
Generative Adversarial Networks Stefano Ermon, Aditya Grover Stanford University Lecture 10 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 10 1 / 17 Selected GANs https://github.com/hindupuravinash/the-gan-zoo
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationConvex optimization COMS 4771
Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More information1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016
AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 8 February 22nd 1 Overview In the previous lecture we saw characterizations of optimality in linear optimization, and we reviewed the
More informationOptimized first-order minimization methods
Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationLecture 2: Convex Sets and Functions
Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationJune 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions
Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all
More informationNN V: The generalized delta learning rule
NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown
More informationStatistical Learning Theory and the C-Loss cost function
Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationLogistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE
Logistic Regression INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE INFO-2301: Quantitative Reasoning 2 Paul and Boyd-Graber Logistic Regression
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More information