Multi Task Averaging: Theory and Practice

Size: px

Start display at page:

Download "Multi Task Averaging: Theory and Practice"

Jeffry Phelps
5 years ago
Views:

1 Multi Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington Sergey Feldman Univ. Washington Bela Frigyik Univ. Pecs 1

$Aristotle The idea of a mean is old: \By the mean of a thing I denote a point equally$

2 Aristotle The idea of a mean is old: \By the mean of a thing I denote a point equally distant from either extreme..." -Aristotle y min v y max v y min = y max v v = y min+y max 2 2

3 Tycho Brahe (16 th century) Averaged to reduce measurement error. ¹y = 1 N P N i=1 y i 3

4 Legendre (1805) Legendre noted the mean minimizes squared error: ¹y =argmin ^¹ NX (y i ^¹) 2 i=1 4

5 Legendre (1805) Legendre noted the mean minimizes squared error: ¹y =argmin ^¹ NX i=1 (y i ^¹) 2 ( Banerjee et al. 2005: the mean minimizes any Bregman divergence: ¹y =argmin ^¹ NX i=1 Ã(y i ; ^¹) Frigyik et al. 2008: the mean minimizes any functional Bregman divergence. ) 5

6 Gauss (1809) The average was central to Gauss's construction of the normal distribution. His goals: - a smooth distribution - whose likelihood peak was at the sample mean. 6

$Fisher 1922 \.$

7 Fisher 1922 \...no other statistic which can be calculated from thesamesample provides any additional information as to the value of the parameter..." -R. A. Fisher 7

8 Stein s Paradox 1956 Total squared error can be reduced by estimating each of the means of T Gaussian random variables using data sampled from all of them, even if the random variables are independent and have di erent means. ¹ 1 t =1 ¹ 2 t =2. ¹ 3 t = T 8

9 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T ¹ 1 ¹ 2 ¹ 3 9

10 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T Maximum Likelihood Estimate ^¹ t = Y t ¹ 1 ¹ 2 ¹ 3 10

11 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T Maximum Likelihood Estimate ^¹ t = Y t ¹ 1 James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t ¹ 2 ¹ 3 11

12 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t 12

13 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known E[¹ t jy t ]= μ1 ¾2 Y 2 + ¾ 2 t ; James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t 13

14 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known E " # (T 2)¾ 2 P T r=1 Y r 2 E[¹ t jy t ]= = ¾2 2 + ¾ 2 μ1 ¾2 Y 2 + ¾ 2 t ; James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t

15 A More General JSE (Bock, 1972) ² ¹ t»n(»; 2 ) ² 2 and» unknown ² Y ti»n(¹ t ;¾ 2 t ) i =1;:::;N t ² ¾ 2 t unknown ^¹ JS t = ^» + General James-Stein Estimate ³ ³ T 3 1 ¹Yt (¹Y») T 1 ( ¹Y») + ^» ² ^» = 1 T P T r=1 ¹ Y r ² Positive-part: (x) + =max(x; 0) ² Diagonal with tt = ^¾2 t N t ² ¹Y is a T length vector with tth entry ¹Y t 15

16 James Stein Dominates James's and Stein's theorem (1961): For T>3the general JSE dominates the sample average (MLE): E[jj¹ ^¹ JS jj 2 2 ] E[jj¹ ¹Y jj 2 2 ] for every choice of ¹. This is also written as: R(^¹ JS ) R( ¹Y ) 16

17 James-Stein Estimation Ã! Empirical Bayes Multi-task Averaging (MTA) Ã! Empirical Loss Minimization with Regularization (Tikhonov Regularization) (Empirical Vapnik) 17

18 Multi Task Averaging Feldman et al Problem: estimate means f¹ t g of T random variables. Data Model: Y ti drawn IID from º t with nite mean ¹ t. Given: N t IID samples fy ti g N t i=1 from each random variable. ¹ 1 t =1 ¹ 2 t =2. ¹ 3 t = T 18

19 Building the MTA Objective \Single-task" averaging: ¹y t =argmin ^¹ t XN t i=1 (y ti ^¹ t ) 2 19

20 Building the MTA Objective \Single-task" averaging: f¹y t g T t=1 =arg min f^¹ t g T t=1 1 T TX XN t (y ti ^¹ t ) 2 t=1 i=1 add across tasks 20

21 Building the MTA Objective \Single-task" averaging: f¹y t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t Mahalanobis distance 21

22 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Mahalanobis distance to samples from task t similarity between task r and s 22

23 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Task 1: Estimate average movie ticket price Task 2: Estimate mean age of kids at summer camp 23

$\Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t$

24 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Task 1: Estimate average movie ticket price Task 2: Estimate price of tea in China? 24

25 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 25

26 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 26

27 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 27

28 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 28

29 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 29

30 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 30

31 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 empirical loss lowers bias regularizer: lowers estimation variance 31

32 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 For non-negative A: ^¹ MTA = ³ I + T L 1 ¹y vector of T MTA solution diagonal matrix of sample mean variances ¾2 t graph Laplacian of A + A T vector of T sample averages N t 32

33 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y Lemma: this inverse always exists if A rs 0, 0, and N t 1. 33

34 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y MTA estimates are a linear combo of sample averages T T matrix W T sample averages 34

35 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y Theorem: convex combo of sample averages right-stochastic matrix W T sample averages 35

36 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 N 1 samples, ¾ 2 1 Y 2i = ¹ 2 + ² 2 N 2 samples, ¾ 2 2 Task 1: Estimate average movie ticket price Task 2: Estimate mean age of kids at summer camp 36

37 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 N 1 samples, ¾ 2 1 Y 2i = ¹ 2 + ² 2 N 2 samples, ¾ 2 2 MTA estimate I + L 1 ¹ T Y : Ã! T + ¾2 2 ^¹ MTA N 1 = 2 A 12 ¹Y 1 + T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12 Ã ¾ 2 1 N 1 A 12 T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12! ¹Y 2 : 37

38 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 Y 2i = ¹ 2 + ² 2 N 1 samples, ¾ 2 1 N 2 samples, ¾ 2 2 MTA estimate I + L 1 ¹ T Y : Ã! T + ¾2 2 ^¹ MTA N 1 = 2 A 12 ¹Y 1 + T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12 Ã ¾ 2 1 N 1 A 12 T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12! ¹Y 2 : Biased, but smaller error variance than sample averages. Risk[^¹ MTA 1 ] < Risk[ ¹ Y 1 ]if(¹ 1 ¹ 2 ) 2 < 4 A 12 + ¾2 1 N 1 + ¾2 2 N 2 38

39 Risk[^¹ MTA ] < Risk[ ¹ Y]if(¹ 1 ¹ 2 ) 2 ¾2 1 N 1 ¾2 2 N 2 < 4 A 12 ¹ 2 39

40 Optimal A for T = 2 Example: ¹ 1 ¹ 2 Y 1 Y 2 Answer: the optimal task similarity in terms of MSE: A 12 = 2 (¹ 1 ¹ 2 ) 2 40

41 Estimated Optimal A for T = 2 Example: ¹ 1 ¹ 2 Y 1 Y 2 Answer: the optimal task similarity in terms of MSE: A 12 = 2 (¹ 1 ¹ 2 ) 2 estimated ^A 12 = 2 (¹y 1 ¹y 2 ) 2 41

42 ¾ 2 1 =1 ¾ 2 2 =1 Optimal sim: A 12 = 2 (¹ 1 ¹ 2 ) 2 ¹ 1 =0;¹ 2 =1 A 12 42

43 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. 43

44 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² One solution: use pairwise estimate to populate A: ^A rs = 2 (¹y r ¹y s ) 2 ^A 12 ^A 13 ^A 21 ^A 23 ^A 31 ^A 32 44

45 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. a a a a a

46 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. ² Analyzable: optimal similarity a is a = 1 T (T 1) T r=1 2 T s=1 (¹ r ¹ s ) 2 a a a average squared distance between the T means a a 46

47 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. ² Analyzable: optimal similarity a is a = 1 T (T 1) T r=1 2 T s=1 (¹ r ¹ s ) 2 a a ² In practice, we estimate a ^a = 1 T (T 1) T r=1 2 T s=1 (¹y r ¹y s ) 2 a a 47

48 How to Set Similarity Matrix A? Choose A to minimize expected total squared error. Choose A to minimize worst-case total squared error. 48

49 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. (b l ;b u ) (b u ;b u ) (b l ;b l ) (b u ;b l )

50 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. (b l ;b u ) (b u ;b u ) (b l ;b l ) (b u ;b l )

51 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable 8 prior (LFP): 1 >< ; if (¹ 2 1;¹ 2 )=(b l ;b u ) 1 p(¹ 1 ;¹ 2 )= 2 >: ; if (¹ 1;¹ 2 )=(b u ;b l ) 0; otherwise. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l )

52 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with sim A mm 12 = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l )

53 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with sim A mm 12 = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l ) ¹y 2 ^bl =min t ¹y t ¹y 1 ^bu =max t ¹y t

54 Minimax A for T > 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with constant sim a mm = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l ) ¹y 2 ¹y 1 ¹y 3 ¹y 4 ¹y 5 ^bl =min t ¹y t ^bu =max t ¹y t

55 Estimator Summary 55

56 Simulations Gaussian Simulations Uniform Simulations ¹ t»n(0;¾¹ 2 ) ¹ t» U( p 3¾¹ 2 ; p 3¾¹ 2 ) ¾t 2» Gamma(0:9; 1:0) + 0:1 ¾2 t» U(0:1; 2:0) N t» Uf2;:::;100g N t» Uf2;:::;100g y ti» N (¹ t ;¾t 2 ) y ti» U[¹ t p 3¾t 2 ;¹ t + p 3¾t 2 ] 56

57 5 fold randomized Cross Validation 57

58 Gaussian Simulation, T = 5 (Lower is better.) 58

59 Gaussian Simulation, T = 25 59

60 Gaussian Simulation, T =

61 Uniform Simulation, T = 5 61

62 Uniform Simulation, T = 25 62

63 Uniform Simulation, T =

64 Scales O(T) Constant and minimax MTA weight matrices can be rewritten: W =(I + L(a11 T )) 1 =(I + (at I a11 T )) 1 =(I + at a 11 T ) 1 =(Z z1 T ) 1 = Z 1 + Z 1 z1 T Z T Z 1 z Sherman Morrison formula Z is diagonal so Z 1, Z 1 z,and1 T Z 1 can all be computed in O(T ) W ¹Y is O(T) 64

65 Application: Class Grades Problem: estimate nal grades f¹ t g of T students. Given: N homework grades fy ti g N i=1 from each student. 65

66 Application: Class Grades Problem: estimate nal grades f¹ t g of T students. Given: N homework grades fy ti g N i=1 from each student. ² 16 classrooms! 16 datasets. ² Uncurved grades normalized to be between 0 and 100. ² Pooled variance used for all tasks in a dataset. ² Final class grades include homeworks, projects, labs, quizzes, midterms, and the nal exam. 66

67 Percent change in risk vs. single-task. 67

68 Percent change in risk vs. single-task. 68

69 Percent change in risk vs. single-task. 69

70 Percent change in risk vs. single-task. 70

71 Application: Product Sales Exp 1: How much will tth customer spend on their next order? Given: $amountsfy ti g N t i=1 that tth customer spent on N t orders. ² T =477 ² y ti ranged from $15 to $480. ² N t ranged from 2 to

72 Application: Product Sales Exp 2: If you bought the tth puzzle, how much will you spend on your next order? Given: $amountsfy ti g N t i=1 that each of N t customers spent after buying the tth puzzle. ² T =77 ² y ti ranged from $0 to $480. ² N t ranged from 8 to

73 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: Customer T : 73

74 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: ¹ 1 ¹ 2 Customer T : ¹ T 74

75 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: ¹y 1 ¹y 2 Customer T : ¹y T 75

76 Application: Product Sales Percent change in risk vs. single-task averaged over 1000 random splits. (Lower is better.)

77 Application: Product Sales Percent change in risk vs. single-task averaged over 1000 random splits. (Lower is better.) 77

78 Model Mismatch: 2008 Election Problem: What percent of tth state's vote will go to Obama and McCain on election day? Given: N t pre-election polls fy ti g N t i=1 from each state. 78

79 Model Mismatch: 2008 Election Problem: What percent of tth state's vote will go to Obama and McCain on election day? Given: N t pre-election polls fy ti g N t i=1 from each state. Percent change in average risk vs. single-task. (Lower is better.) 79

80 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as ^p(z)= 1 N P N i=1 K(x i;z) x 3 x 1 x 2 x 4 x 5 z 80

81 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as ^p(z)= 1 N P N i=1 K(x i;z) x 3 x 1 x 2 x 4 x 5 z 81

82 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as Equivalently, ^p(z)= 1 N P N i=1 K(x i;z) arg min ^y(z) NX (K(x i ;z) ^y(z)) 2 i=1 82

83 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as Equivalently, ^p(z)= 1 N P N i=1 K(x i;z) arg min ^y(z) NX (K(x i ;z) ^y(z)) 2 i=1 Use MTA to form a Multi-task KDE: arg min f^y t (z t )g T t=1 TX XN t TX (K t (x ti ;z t ) ^y t (z t )) 2 + t=1 i=1 r=1 TX A rs (^y r (z r ) ^y s (z s )) 2 s=1 83

84 MT KDE for Terrorism Risk Assessment Problem: Estimate the probability of terrorist events at 40,000 locations in Jerusalem, each location z;x i 2 R 74 T = 7 terrorist groups Task similarity matrix A from terrorism expert Mohammed Hafez: 84

85 MT KDE for Terrorism Risk Assessment Mean Reciprocal Rank of a Left-Out Event Suicides (T =17) Bombings(T =11) Single task James-Stein MTA constant MTA minimax Expert sim

86 Last Slide MTA is an intuitive, simple, accurate approach to estimating multiple means jointly. When can you estimate multiple means at once? Can you estimate the task similarities better? Learn more: see our 2012 NIPS paper or me for the journal paper 86

87 ^¹ = W ¹y right stochastic W ^¹ = W ¹y MTA: ^¹ = W ¹y W = I + T L 1 diagonal with tt 0 A rs 0, 0 ^¹ t = ¹y t +(1 ) P T r=1 r¹y r 0 < 1 P T r=1 r =1 r 0, 8r (J-S, more)

88 Bayesian Analysis: IGMRFs Recall that: P 1 N P T 2 r=1 s=1 A rs(^y r ^y s ) 2 = 1 2 ^yt L S ^y The above regularizer can be thought of as coming from an intrinsic (improper) GMRF prior (Rue and Held, '05): p(^y) =(2¼) T 2 jl S j 1 2 exp 1 2 ^yt L S ^y where L = D A Usually used for graphical models when L S is sparse. 88

89 Bayesian Analysis Assuming di erences are independent: p(^y) / TY r=1 ^Y TY 1 ^Y 2»N(0; 1=2 A e A for T =3 12) rs(^y r ^y s ) 2 ^Y 2 ^Y 3»N(0; 1=2 A 23 ) s=1 ^Y 1 ^Y 3»N(0; 1=2 A 13 ) ^Y 1 ^Y 3 =(^Y 1 ^Y 2 )+(^Y 2 ^Y 3 )! 1 A 13 = 1 A A 23 ^Y 1 ^Y 2 =(^Y 1 ^Y 3 )+(^Y 3 ^Y 2 )! 1 A 12 = 1 A A 32 ^Y 2 ^Y 3 =(^Y 2 ^Y 1 )+(^Y 1 ^Y 3 )! 1 A 23 = 1 A A 13 Impossible to satisfy all RHS with any nite A! 89

90 Related Multi Task Regularizers P T r=1 k r 1 T P T s=1 sk 2 2 Distance to mean (Evgeniou and Pontil, 2004) jj jj Trace norm (Abernethy et al., 2009) tr( T D 1 ) tr( 1 T ) Learned, shared feature covariance matrix (Argyriou et al., 2008) Learned task covariance matrix (Jacob et al., 2008 and Zhang and Yeung, 2010) P T r=1 P T s=1 A rsk r sk 2 2 Pairwise distance regularizer (Sheldon, 2008) or constraint (Kato et al., 2007) 90

91 Gaussian Simulation, T = 2 (Lower is better.) (Lower is better.) 91

92 Uniform Simulation, T = 2 (Lower is better.) 92

93 Pairwise T=5 Results (Lower is better.) 93

94 Oracle T=5 Results (Lower is better.) 94

95 Stein s Unbiased Risk Estimate ² True A 12 is depends on unknown ¹ t.wepluggedin¹y t to get: ^A 12 = 2 (¹y 1 ¹y 2 ) 2 ² Another approach: minimize Stein's unbiased risk estimate (SURE), which is an empirical proxy Q such that E[Q] =risk. ² Result: ^A SURE 12 = Ã 2 (¹y 1 ¹y 2 ) 2 ^¾2 1 N 1 ^¾2 2 N 2! + 95

96 SURE T=2 Experiments (Lower is better.) 96

97 Alternative Formulation MTA is: I + T L 1 ¹Y with optimal sim: A 12 = 2 (¹ 1 ¹ 2 ) 2 : MTA Variant is: 1=2 I + T L 1 1=2 ¹ Y; with optimal sim: A 12 = 2 ¹ 1 ¾ 1 ¹ 2 ¾ 2 2 : Di erent notions of distance! What if ¹ 1 =2,¾ 1 =1,¹ 2 =4,¾ 2 =2? 97

98 Alternative Formulation T=2 Results (Lower is better.) 98

99 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y more general form of regularized Laplacian kernel (Smola and Kondor, 2003) 99

100 Stein s Unbiased Risk Estimate ² True A 12 is depends on unknown ¹ t.wepluggedin¹y t to get: ^A 12 = 2 (¹y 1 ¹y 2 ) 2 ² Another approach: minimize Stein's unbiased risk estimate (SURE), which is an empirical proxy Q such that E[Q] =risk. ² Result: ^A SURE 12 = Ã 2 (¹y 1 ¹y 2 ) 2 ^¾2 1 N 1 ^¾2 2 N 2! + 100

101 SURE T=2 Experiments (Lower is better.) 101

Revisiting Stein s Paradox: Multi-Task Averaging

Journal of Machine Learning Research 5 04) 36-366 Submitted 8/; Revised 6/4; Published /4 Revisiting Stein s Paradox: Multi-ask Averaging Sergey Feldman Data Cowboys 96 3rd Ave. NE Seattle, WA 985, USA