Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee
A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2
A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing
A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = 1 1 + exp( z)
A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = 1 1 + exp( z) Squared loss and Logistic transfer do not match
A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = 1 1 + exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima
A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = 1 1 + exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima Motivates matching of loss function and transfer function
Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i ))
Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum
Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ
Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) 1 1 + exp( w T x i )
Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) 1 1 + exp( w T x i ) We ignore the statistical connections for now
Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y)
Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= z 2 /2 L ψ (x,y)= x y 2 h(z) y x Squared Euclidean distance is a Bregman divergence
Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)=z log z L ψ (x,y)=x log x y x+y h(z) y x Relative Entropy (KL-divergence) is another Bregman divergence
Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= log z y x h(z) L ψ (x,y)= x y log x y 1 Itakura-Saito distance is another Bregman divergence.
Properties of Bregman Divergences L ψ (x,y) 0, and equals 0 iff x = y Not necessarily symmetric, i.e., L ψ (x,y) L ψ (y,x) Not a metric, i.e., triangle inequality does not hold Strictly convex in the first argument Not necessarily convex in the second argument Three-point property L ψ (x,y) = L ψ (x,z) + L ψ (z,y) (x z) T (ψ(y) ψ(z)) Generalized Pythagoras Theorem: L ψ (x,y) = L ψ (x,z) + L ψ (z,y)
Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ)
Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ
Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ)
Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ) As a result, L ψ (x 1,x 2 ) = f(x 1 ) + f (λ 2 ) x T 1 λ 2 = L φ (λ 2, λ 1 )
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i ))
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ Set up the problem as L ψ (y i, φ(w T x i ))
Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ Set up the problem as L ψ (y i, φ(w T x i )) Note that ψ = f and φ = f are inverses of each other
Online Learning: The Basic Setting Prediction proceeds in trials t = 1,...,T At trial t, the algorithm Gets an input x t Makes a prediction ŷ t = φ(w T t x t ) Receives true output y t Updates weight vector to w t+1 Consider the loss function L ψ (y t, φ(w T t x t )) = L φ (w T t x t, ψ(y t )) = (y t w T t x t ) 2 The corresponding gradient descent update, with ŷ t = w T t x t w t+1 = w t η(ŷ t y t )x t What happens in the general case?
The General Setting General: Multidimensional output, arbitrary matching loss
The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t
The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t )
The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t
The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t The cost function to minimize U(W) = L φ (W, W t ) + ηl ψ (y t, φ(wx t )) = L φ (W, W t ) + ηl φ (Wx t, ψ(y t ))
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 )
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 )
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation The additive update is on the conjugate space
1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation The additive update is on the conjugate space GD and EG are special cases
1 Relative Loss Bounds Cumulative loss of General Additive (GA) algorithm Loss φ (GA, S) = T L ψ (y t, φ(w t x t )) t=1 Best cumulative loss looking back Loss φ (W, S) = T L ψ (y t, φ(wx t )) t=1 Two types of relative loss bounds Loss φ (GA, S) p Loss φ (W, S) + q Loss φ (GA, S) Loss φ (W, S) + q 1 Loss φ (W, S) + q 2
1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z)
1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ)
1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x)
1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i ))
1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i )) This is the problem we discussed today