Relative Loss Bounds for Multidimensional Regression Problems

Size: px

Start display at page:

Download "Relative Loss Bounds for Multidimensional Regression Problems"

Clarissa Webb
5 years ago
Views:

1 Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee

2 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2

3 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing

4 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z)

5 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z) Squared loss and Logistic transfer do not match

6 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima

7 A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves min w (y φ(wt x)) 2 The transfer function φ : R R is increasing Popular choice φ(z) = exp( z) Squared loss and Logistic transfer do not match May lead to exponentially many local minima Each dimension has a local minimum for each example With n examples in d dimensions, n d d local minima Motivates matching of loss function and transfer function

8 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i ))

9 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum

10 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ

11 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) exp( w T x i )

12 Generalized Linear Models Employ matching loss and transfer functions L ψ (y i, φ(w T x i )) Convex function of w with a single minimum For logistic transfer function, matching loss is L ψ (y, ŷ) = KL(y ŷ) = y log y ŷ + (1 y) log 1 y 1 ŷ The empirical loss is convex in w w = min w KL ( y i, ) exp( w T x i ) We ignore the statistical connections for now

13 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y)

14 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= z 2 /2 L ψ (x,y)= x y 2 h(z) y x Squared Euclidean distance is a Bregman divergence

15 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)=z log z L ψ (x,y)=x log x y x+y h(z) y x Relative Entropy (KL-divergence) is another Bregman divergence

16 Bregman Divergences Let f be a differentiable, strictly convex function with gradient ψ, i.e., f(x) = ψ(x). The Bregman Divergence L ψ is defined as L ψ (x, y) = f(x) f(y) (x y) T ψ(y) f(z)= log z y x h(z) L ψ (x,y)= x y log x y 1 Itakura-Saito distance is another Bregman divergence.

17 Properties of Bregman Divergences L ψ (x,y) 0, and equals 0 iff x = y Not necessarily symmetric, i.e., L ψ (x,y) L ψ (y,x) Not a metric, i.e., triangle inequality does not hold Strictly convex in the first argument Not necessarily convex in the second argument Three-point property L ψ (x,y) = L ψ (x,z) + L ψ (z,y) (x z) T (ψ(y) ψ(z)) Generalized Pythagoras Theorem: L ψ (x,y) = L ψ (x,z) + L ψ (z,y)

18 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ)

19 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ

20 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ)

21 Bregman Divergence and The Conjugate Recall that f (λ) = sup x (λ T x f( x)) = λ T x f(x) f(x) = sup λ ( λ T x f ( λ)) = λ T x f (λ) Hence, f(x) + f (λ) = x T λ Further, with ψ(x) = f(x), φ(λ) = f (λ), λ = f(x) = ψ(x) x = f (λ) = φ(λ) As a result, L ψ (x 1,x 2 ) = f(x 1 ) + f (λ 2 ) x T 1 λ 2 = L φ (λ 2, λ 1 )

22 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i ))

23 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum

24 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss

25 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ

26 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1

27 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ

28 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ Set up the problem as L ψ (y i, φ(w T x i ))

29 Generalized Linear Models, Take 2 For matching loss and transfer function L ψ (y i, φ(w T x i )) = L φ (w T x i, ψ(y i )) The loss is a convex function of w with a single minimum The strategy for getting matching loss Choose you favorite increasing transfer function φ Let ψ = φ 1 Let L ψ be the Bregman divergence derived from ψ Set up the problem as L ψ (y i, φ(w T x i )) Note that ψ = f and φ = f are inverses of each other

30 Online Learning: The Basic Setting Prediction proceeds in trials t = 1,...,T At trial t, the algorithm Gets an input x t Makes a prediction ŷ t = φ(w T t x t ) Receives true output y t Updates weight vector to w t+1 Consider the loss function L ψ (y t, φ(w T t x t )) = L φ (w T t x t, ψ(y t )) = (y t w T t x t ) 2 The corresponding gradient descent update, with ŷ t = w T t x t w t+1 = w t η(ŷ t y t )x t What happens in the general case?

31 The General Setting General: Multidimensional output, arbitrary matching loss

32 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t

33 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t )

34 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t

35 The General Setting General: Multidimensional output, arbitrary matching loss The loss on trial t L ψ (y t, φ(w t x t )) = L φ (W t x t, ψ(y t )) where W t is the weight matrix at trial t Consider parameter matrix Θ t = φ(w t ) so that W t = ψ(θ t ) We want to get W = W t+1 as a regularized update It should have low loss on the current example (x t,y t ) It should be close to the current weight matrix W t The cost function to minimize U(W) = L φ (W, W t ) + ηl ψ (y t, φ(wx t )) = L φ (W, W t ) + ηl φ (Wx t, ψ(y t ))

36 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t

37 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 )

38 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t

39 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 )

40 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation

41 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation The additive update is on the conjugate space

42 1 The Gradient Update Set gradient w.r.t. w j to 0 φ(w j t+1 ) = φ(wj t) η(φ j (W t+1 x t ) y j )x t Recall that Θ t+1 = φ(w t+1 ) with θ j t+1 = φ(wj t+1 ) The update equation θ j t+1 = θj t η(φ j (W t+1 x t ) y j )x t θ j t η(φ j (W t x t ) y j )x t = θ j t η(ŷ j t y j )x t The updated W t+1 = ψ(θ t+1 ) Weight vector W t+1 gets updated using the Θ t+1 representation The additive update is on the conjugate space GD and EG are special cases

43 1 Relative Loss Bounds Cumulative loss of General Additive (GA) algorithm Loss φ (GA, S) = T L ψ (y t, φ(w t x t )) t=1 Best cumulative loss looking back Loss φ (W, S) = T L ψ (y t, φ(wx t )) t=1 Two types of relative loss bounds Loss φ (GA, S) p Loss φ (W, S) + q Loss φ (GA, S) Loss φ (W, S) + q 1 Loss φ (W, S) + q 2

44 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z)

45 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ)

46 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x)

47 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i ))

48 1 The Statistics Connection: Overview Exponential family distribution p θ (z) = exp(z T θ f(θ))p 0 (z) Exponential families Bregman divergences p θ (z) = exp(z T θ f(θ))p 0 (z) = exp( L ψ (y, µ))f 0 (y) where ψ = φ 1 = ( f) 1,y = φ(z), µ = φ(θ) Consider models where θ = w T x µ = φ(w T x) Maximum likelihood estimation boils down to max θ log p(z) = min w L ψ (y i, φ(w T x i )) This is the problem we discussed today

Bregman Divergences for Data Mining Meta-Algorithms

p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,