Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee

A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2

A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing

A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = 1 1 + exp( z)

A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = 1 1 + exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima

Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i ))

Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum

Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) 1 1 + exp( w T x i )

Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y)

Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= z 2 /2 L ψ (x,y)= x y 2 h(z) y x Squared Euclidean distance is a Bregman divergence

Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)=z log z L ψ (x,y)=x log x y x+y h(z) y x Relative Entropy (KL-divergence) is another Bregman divergence

Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= log z y x h(z) L ψ (x,y)= x y log x y 1 Itakura-Saito distance is another Bregman divergence.

Properties of Bregman Divergences L ψ (x,y) 0, and equals 0 iff x = y Not necessarily symmetric, i.e., L ψ (x,y) L ψ (y,x) Not a metric, i.e., triangle inequality does not hold Strictly convex in the first argument Not necessarily convex in the second argument Three-point property L ψ (x,y) = L ψ (x,z) + L ψ (z,y) (x z) T (ψ(y) ψ(z)) Generalized Pythagoras Theorem: L ψ (x,y) = L ψ (x,z) + L ψ (z,y)

Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ)

Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ

Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ)

Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i ))

Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum

Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ

Online Learning: The Basic Setting Prediction proceeds in trials t = 1,...,T At trial t, the algorithm Gets an input x t Makes a prediction ŷ t = φ(w T t x t ) Receives true output y t Updates weight vector to w t+1 Consider the loss function L ψ (y t, φ(w T t x t )) = L φ (w T t x t, ψ(y t )) = (y t w T t x t ) 2 The corresponding gradient descent update, with ŷ t = w T t x t w t+1 = w t η(ŷ t y t )x t What happens in the general case?

The General Setting General: Multidimensional output, arbitrary matching loss

The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t

The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t

1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t

1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 )

1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 )

1 Relative Loss Bounds Cumulative loss of General Additive (GA) algorithm Loss φ (GA, S) = T L ψ (y t, φ(w t x t )) t=1 Best cumulative loss looking back Loss φ (W, S) = T L ψ (y t, φ(wx t )) t=1 Two types of relative loss bounds Loss φ (GA, S) p Loss φ (W, S) + q Loss φ (GA, S) Loss φ (W, S) + q 1 Loss φ (W, S) + q 2

1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z)

1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i ))