Multi Task Averaging: Theory and Practice

Size: px
Start display at page:

Download "Multi Task Averaging: Theory and Practice"

Transcription

1 Multi Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington Sergey Feldman Univ. Washington Bela Frigyik Univ. Pecs 1

2 Aristotle The idea of a mean is old: \By the mean of a thing I denote a point equally distant from either extreme..." -Aristotle y min v y max v y min = y max v v = y min+y max 2 2

3 Tycho Brahe (16 th century) Averaged to reduce measurement error. ¹y = 1 N P N i=1 y i 3

4 Legendre (1805) Legendre noted the mean minimizes squared error: ¹y =argmin ^¹ NX (y i ^¹) 2 i=1 4

5 Legendre (1805) Legendre noted the mean minimizes squared error: ¹y =argmin ^¹ NX i=1 (y i ^¹) 2 ( Banerjee et al. 2005: the mean minimizes any Bregman divergence: ¹y =argmin ^¹ NX i=1 Ã(y i ; ^¹) Frigyik et al. 2008: the mean minimizes any functional Bregman divergence. ) 5

6 Gauss (1809) The average was central to Gauss's construction of the normal distribution. His goals: - a smooth distribution - whose likelihood peak was at the sample mean. 6

7 Fisher 1922 \...no other statistic which can be calculated from thesamesample provides any additional information as to the value of the parameter..." -R. A. Fisher 7

8 Stein s Paradox 1956 Total squared error can be reduced by estimating each of the means of T Gaussian random variables using data sampled from all of them, even if the random variables are independent and have di erent means. ¹ 1 t =1 ¹ 2 t =2. ¹ 3 t = T 8

9 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T ¹ 1 ¹ 2 ¹ 3 9

10 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T Maximum Likelihood Estimate ^¹ t = Y t ¹ 1 ¹ 2 ¹ 3 10

11 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T Maximum Likelihood Estimate ^¹ t = Y t ¹ 1 James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t ¹ 2 ¹ 3 11

12 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t 12

13 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known E[¹ t jy t ]= μ1 ¾2 Y 2 + ¾ 2 t ; James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t 13

14 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known E " # (T 2)¾ 2 P T r=1 Y r 2 E[¹ t jy t ]= = ¾2 2 + ¾ 2 μ1 ¾2 Y 2 + ¾ 2 t ; James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t

15 A More General JSE (Bock, 1972) ² ¹ t»n(»; 2 ) ² 2 and» unknown ² Y ti»n(¹ t ;¾ 2 t ) i =1;:::;N t ² ¾ 2 t unknown ^¹ JS t = ^» + General James-Stein Estimate ³ ³ T 3 1 ¹Yt (¹Y») T 1 ( ¹Y») + ^» ² ^» = 1 T P T r=1 ¹ Y r ² Positive-part: (x) + =max(x; 0) ² Diagonal with tt = ^¾2 t N t ² ¹Y is a T length vector with tth entry ¹Y t 15

16 James Stein Dominates James's and Stein's theorem (1961): For T>3the general JSE dominates the sample average (MLE): E[jj¹ ^¹ JS jj 2 2 ] E[jj¹ ¹Y jj 2 2 ] for every choice of ¹. This is also written as: R(^¹ JS ) R( ¹Y ) 16

17 James-Stein Estimation Ã! Empirical Bayes Multi-task Averaging (MTA) Ã! Empirical Loss Minimization with Regularization (Tikhonov Regularization) (Empirical Vapnik) 17

18 Multi Task Averaging Feldman et al Problem: estimate means f¹ t g of T random variables. Data Model: Y ti drawn IID from º t with nite mean ¹ t. Given: N t IID samples fy ti g N t i=1 from each random variable. ¹ 1 t =1 ¹ 2 t =2. ¹ 3 t = T 18

19 Building the MTA Objective \Single-task" averaging: ¹y t =argmin ^¹ t XN t i=1 (y ti ^¹ t ) 2 19

20 Building the MTA Objective \Single-task" averaging: f¹y t g T t=1 =arg min f^¹ t g T t=1 1 T TX XN t (y ti ^¹ t ) 2 t=1 i=1 add across tasks 20

21 Building the MTA Objective \Single-task" averaging: f¹y t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t Mahalanobis distance 21

22 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Mahalanobis distance to samples from task t similarity between task r and s 22

23 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Task 1: Estimate average movie ticket price Task 2: Estimate mean age of kids at summer camp 23

24 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Task 1: Estimate average movie ticket price Task 2: Estimate price of tea in China? 24

25 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 25

26 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 26

27 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 27

28 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 28

29 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 29

30 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 30

31 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 empirical loss lowers bias regularizer: lowers estimation variance 31

32 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 For non-negative A: ^¹ MTA = ³ I + T L 1 ¹y vector of T MTA solution diagonal matrix of sample mean variances ¾2 t graph Laplacian of A + A T vector of T sample averages N t 32

33 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y Lemma: this inverse always exists if A rs 0, 0, and N t 1. 33

34 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y MTA estimates are a linear combo of sample averages T T matrix W T sample averages 34

35 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y Theorem: convex combo of sample averages right-stochastic matrix W T sample averages 35

36 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 N 1 samples, ¾ 2 1 Y 2i = ¹ 2 + ² 2 N 2 samples, ¾ 2 2 Task 1: Estimate average movie ticket price Task 2: Estimate mean age of kids at summer camp 36

37 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 N 1 samples, ¾ 2 1 Y 2i = ¹ 2 + ² 2 N 2 samples, ¾ 2 2 MTA estimate I + L 1 ¹ T Y : Ã! T + ¾2 2 ^¹ MTA N 1 = 2 A 12 ¹Y 1 + T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12 Ã ¾ 2 1 N 1 A 12 T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12! ¹Y 2 : 37

38 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 Y 2i = ¹ 2 + ² 2 N 1 samples, ¾ 2 1 N 2 samples, ¾ 2 2 MTA estimate I + L 1 ¹ T Y : Ã! T + ¾2 2 ^¹ MTA N 1 = 2 A 12 ¹Y 1 + T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12 Ã ¾ 2 1 N 1 A 12 T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12! ¹Y 2 : Biased, but smaller error variance than sample averages. Risk[^¹ MTA 1 ] < Risk[ ¹ Y 1 ]if(¹ 1 ¹ 2 ) 2 < 4 A 12 + ¾2 1 N 1 + ¾2 2 N 2 38

39 Risk[^¹ MTA ] < Risk[ ¹ Y]if(¹ 1 ¹ 2 ) 2 ¾2 1 N 1 ¾2 2 N 2 < 4 A 12 ¹ 2 39

40 Optimal A for T = 2 Example: ¹ 1 ¹ 2 Y 1 Y 2 Answer: the optimal task similarity in terms of MSE: A 12 = 2 (¹ 1 ¹ 2 ) 2 40

41 Estimated Optimal A for T = 2 Example: ¹ 1 ¹ 2 Y 1 Y 2 Answer: the optimal task similarity in terms of MSE: A 12 = 2 (¹ 1 ¹ 2 ) 2 estimated ^A 12 = 2 (¹y 1 ¹y 2 ) 2 41

42 ¾ 2 1 =1 ¾ 2 2 =1 Optimal sim: A 12 = 2 (¹ 1 ¹ 2 ) 2 ¹ 1 =0;¹ 2 =1 A 12 42

43 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. 43

44 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² One solution: use pairwise estimate to populate A: ^A rs = 2 (¹y r ¹y s ) 2 ^A 12 ^A 13 ^A 21 ^A 23 ^A 31 ^A 32 44

45 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. a a a a a

46 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. ² Analyzable: optimal similarity a is a = 1 T (T 1) T r=1 2 T s=1 (¹ r ¹ s ) 2 a a a average squared distance between the T means a a 46

47 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. ² Analyzable: optimal similarity a is a = 1 T (T 1) T r=1 2 T s=1 (¹ r ¹ s ) 2 a a ² In practice, we estimate a ^a = 1 T (T 1) T r=1 2 T s=1 (¹y r ¹y s ) 2 a a 47

48 How to Set Similarity Matrix A? Choose A to minimize expected total squared error. Choose A to minimize worst-case total squared error. 48

49 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. (b l ;b u ) (b u ;b u ) (b l ;b l ) (b u ;b l )

50 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. (b l ;b u ) (b u ;b u ) (b l ;b l ) (b u ;b l )

51 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable 8 prior (LFP): 1 >< ; if (¹ 2 1;¹ 2 )=(b l ;b u ) 1 p(¹ 1 ;¹ 2 )= 2 >: ; if (¹ 1;¹ 2 )=(b u ;b l ) 0; otherwise. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l )

52 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with sim A mm 12 = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l )

53 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with sim A mm 12 = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l ) ¹y 2 ^bl =min t ¹y t ¹y 1 ^bu =max t ¹y t

54 Minimax A for T > 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with constant sim a mm = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l ) ¹y 2 ¹y 1 ¹y 3 ¹y 4 ¹y 5 ^bl =min t ¹y t ^bu =max t ¹y t

55 Estimator Summary 55

56 Simulations Gaussian Simulations Uniform Simulations ¹ t»n(0;¾¹ 2 ) ¹ t» U( p 3¾¹ 2 ; p 3¾¹ 2 ) ¾t 2» Gamma(0:9; 1:0) + 0:1 ¾2 t» U(0:1; 2:0) N t» Uf2;:::;100g N t» Uf2;:::;100g y ti» N (¹ t ;¾t 2 ) y ti» U[¹ t p 3¾t 2 ;¹ t + p 3¾t 2 ] 56

57 5 fold randomized Cross Validation 57

58 Gaussian Simulation, T = 5 (Lower is better.) 58

59 Gaussian Simulation, T = 25 59

60 Gaussian Simulation, T =

61 Uniform Simulation, T = 5 61

62 Uniform Simulation, T = 25 62

63 Uniform Simulation, T =

64 Scales O(T) Constant and minimax MTA weight matrices can be rewritten: W =(I + L(a11 T )) 1 =(I + (at I a11 T )) 1 =(I + at a 11 T ) 1 =(Z z1 T ) 1 = Z 1 + Z 1 z1 T Z T Z 1 z Sherman Morrison formula Z is diagonal so Z 1, Z 1 z,and1 T Z 1 can all be computed in O(T ) W ¹Y is O(T) 64

65 Application: Class Grades Problem: estimate nal grades f¹ t g of T students. Given: N homework grades fy ti g N i=1 from each student. 65

66 Application: Class Grades Problem: estimate nal grades f¹ t g of T students. Given: N homework grades fy ti g N i=1 from each student. ² 16 classrooms! 16 datasets. ² Uncurved grades normalized to be between 0 and 100. ² Pooled variance used for all tasks in a dataset. ² Final class grades include homeworks, projects, labs, quizzes, midterms, and the nal exam. 66

67 Percent change in risk vs. single-task. 67

68 Percent change in risk vs. single-task. 68

69 Percent change in risk vs. single-task. 69

70 Percent change in risk vs. single-task. 70

71 Application: Product Sales Exp 1: How much will tth customer spend on their next order? Given: $amountsfy ti g N t i=1 that tth customer spent on N t orders. ² T =477 ² y ti ranged from $15 to $480. ² N t ranged from 2 to

72 Application: Product Sales Exp 2: If you bought the tth puzzle, how much will you spend on your next order? Given: $amountsfy ti g N t i=1 that each of N t customers spent after buying the tth puzzle. ² T =77 ² y ti ranged from $0 to $480. ² N t ranged from 8 to

73 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: Customer T : 73

74 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: ¹ 1 ¹ 2 Customer T : ¹ T 74

75 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: ¹y 1 ¹y 2 Customer T : ¹y T 75

76 Application: Product Sales Percent change in risk vs. single-task averaged over 1000 random splits. (Lower is better.)

77 Application: Product Sales Percent change in risk vs. single-task averaged over 1000 random splits. (Lower is better.) 77

78 Model Mismatch: 2008 Election Problem: What percent of tth state's vote will go to Obama and McCain on election day? Given: N t pre-election polls fy ti g N t i=1 from each state. 78

79 Model Mismatch: 2008 Election Problem: What percent of tth state's vote will go to Obama and McCain on election day? Given: N t pre-election polls fy ti g N t i=1 from each state. Percent change in average risk vs. single-task. (Lower is better.) 79

80 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as ^p(z)= 1 N P N i=1 K(x i;z) x 3 x 1 x 2 x 4 x 5 z 80

81 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as ^p(z)= 1 N P N i=1 K(x i;z) x 3 x 1 x 2 x 4 x 5 z 81

82 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as Equivalently, ^p(z)= 1 N P N i=1 K(x i;z) arg min ^y(z) NX (K(x i ;z) ^y(z)) 2 i=1 82

83 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as Equivalently, ^p(z)= 1 N P N i=1 K(x i;z) arg min ^y(z) NX (K(x i ;z) ^y(z)) 2 i=1 Use MTA to form a Multi-task KDE: arg min f^y t (z t )g T t=1 TX XN t TX (K t (x ti ;z t ) ^y t (z t )) 2 + t=1 i=1 r=1 TX A rs (^y r (z r ) ^y s (z s )) 2 s=1 83

84 MT KDE for Terrorism Risk Assessment Problem: Estimate the probability of terrorist events at 40,000 locations in Jerusalem, each location z;x i 2 R 74 T = 7 terrorist groups Task similarity matrix A from terrorism expert Mohammed Hafez: 84

85 MT KDE for Terrorism Risk Assessment Mean Reciprocal Rank of a Left-Out Event Suicides (T =17) Bombings(T =11) Single task James-Stein MTA constant MTA minimax Expert sim

86 Last Slide MTA is an intuitive, simple, accurate approach to estimating multiple means jointly. When can you estimate multiple means at once? Can you estimate the task similarities better? Learn more: see our 2012 NIPS paper or me for the journal paper 86

87 ^¹ = W ¹y right stochastic W ^¹ = W ¹y MTA: ^¹ = W ¹y W = I + T L 1 diagonal with tt 0 A rs 0, 0 ^¹ t = ¹y t +(1 ) P T r=1 r¹y r 0 < 1 P T r=1 r =1 r 0, 8r (J-S, more)

88 Bayesian Analysis: IGMRFs Recall that: P 1 N P T 2 r=1 s=1 A rs(^y r ^y s ) 2 = 1 2 ^yt L S ^y The above regularizer can be thought of as coming from an intrinsic (improper) GMRF prior (Rue and Held, '05): p(^y) =(2¼) T 2 jl S j 1 2 exp 1 2 ^yt L S ^y where L = D A Usually used for graphical models when L S is sparse. 88

89 Bayesian Analysis Assuming di erences are independent: p(^y) / TY r=1 ^Y TY 1 ^Y 2»N(0; 1=2 A e A for T =3 12) rs(^y r ^y s ) 2 ^Y 2 ^Y 3»N(0; 1=2 A 23 ) s=1 ^Y 1 ^Y 3»N(0; 1=2 A 13 ) ^Y 1 ^Y 3 =(^Y 1 ^Y 2 )+(^Y 2 ^Y 3 )! 1 A 13 = 1 A A 23 ^Y 1 ^Y 2 =(^Y 1 ^Y 3 )+(^Y 3 ^Y 2 )! 1 A 12 = 1 A A 32 ^Y 2 ^Y 3 =(^Y 2 ^Y 1 )+(^Y 1 ^Y 3 )! 1 A 23 = 1 A A 13 Impossible to satisfy all RHS with any nite A! 89

90 Related Multi Task Regularizers P T r=1 k r 1 T P T s=1 sk 2 2 Distance to mean (Evgeniou and Pontil, 2004) jj jj Trace norm (Abernethy et al., 2009) tr( T D 1 ) tr( 1 T ) Learned, shared feature covariance matrix (Argyriou et al., 2008) Learned task covariance matrix (Jacob et al., 2008 and Zhang and Yeung, 2010) P T r=1 P T s=1 A rsk r sk 2 2 Pairwise distance regularizer (Sheldon, 2008) or constraint (Kato et al., 2007) 90

91 Gaussian Simulation, T = 2 (Lower is better.) (Lower is better.) 91

92 Uniform Simulation, T = 2 (Lower is better.) 92

93 Pairwise T=5 Results (Lower is better.) 93

94 Oracle T=5 Results (Lower is better.) 94

95 Stein s Unbiased Risk Estimate ² True A 12 is depends on unknown ¹ t.wepluggedin¹y t to get: ^A 12 = 2 (¹y 1 ¹y 2 ) 2 ² Another approach: minimize Stein's unbiased risk estimate (SURE), which is an empirical proxy Q such that E[Q] =risk. ² Result: ^A SURE 12 = Ã 2 (¹y 1 ¹y 2 ) 2 ^¾2 1 N 1 ^¾2 2 N 2! + 95

96 SURE T=2 Experiments (Lower is better.) 96

97 Alternative Formulation MTA is: I + T L 1 ¹Y with optimal sim: A 12 = 2 (¹ 1 ¹ 2 ) 2 : MTA Variant is: 1=2 I + T L 1 1=2 ¹ Y; with optimal sim: A 12 = 2 ¹ 1 ¾ 1 ¹ 2 ¾ 2 2 : Di erent notions of distance! What if ¹ 1 =2,¾ 1 =1,¹ 2 =4,¾ 2 =2? 97

98 Alternative Formulation T=2 Results (Lower is better.) 98

99 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y more general form of regularized Laplacian kernel (Smola and Kondor, 2003) 99

100 Stein s Unbiased Risk Estimate ² True A 12 is depends on unknown ¹ t.wepluggedin¹y t to get: ^A 12 = 2 (¹y 1 ¹y 2 ) 2 ² Another approach: minimize Stein's unbiased risk estimate (SURE), which is an empirical proxy Q such that E[Q] =risk. ² Result: ^A SURE 12 = Ã 2 (¹y 1 ¹y 2 ) 2 ^¾2 1 N 1 ^¾2 2 N 2! + 100

101 SURE T=2 Experiments (Lower is better.) 101

Revisiting Stein s Paradox: Multi-Task Averaging

Revisiting Stein s Paradox: Multi-Task Averaging Journal of Machine Learning Research 5 04) 36-366 Submitted 8/; Revised 6/4; Published /4 Revisiting Stein s Paradox: Multi-ask Averaging Sergey Feldman Data Cowboys 96 3rd Ave. NE Seattle, WA 985, USA

More information

Revisiting Stein s Paradox: Multi-Task Averaging

Revisiting Stein s Paradox: Multi-Task Averaging Journal of Machine Learning Research 5 04) Submitted ; Published - Revisiting Stein s Paradox: Multi-ask Averaging Sergey Feldman Data Cowboys 96 3rd Ave. NE Seattle, WA 985, USA Maya R. Gupta Google 5

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

STAT215: Solutions for Homework 2

STAT215: Solutions for Homework 2 STAT25: Solutions for Homework 2 Due: Wednesday, Feb 4. (0 pt) Suppose we take one observation, X, from the discrete distribution, x 2 0 2 Pr(X x θ) ( θ)/4 θ/2 /2 (3 θ)/2 θ/4, 0 θ Find an unbiased estimator

More information

GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT

GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT Cha Zhang Joint work with Dinei Florêncio and Philip A. Chou Microsoft Research Outline Gaussian Markov Random Field Graph construction Graph transform

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Lecture 5 September 19

Lecture 5 September 19 IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Managing Uncertainty

Managing Uncertainty Managing Uncertainty Bayesian Linear Regression and Kalman Filter December 4, 2017 Objectives The goal of this lab is multiple: 1. First it is a reminder of some central elementary notions of Bayesian

More information

Journal of Statistical Research 2007, Vol. 41, No. 1, pp Bangladesh

Journal of Statistical Research 2007, Vol. 41, No. 1, pp Bangladesh Journal of Statistical Research 007, Vol. 4, No., pp. 5 Bangladesh ISSN 056-4 X ESTIMATION OF AUTOREGRESSIVE COEFFICIENT IN AN ARMA(, ) MODEL WITH VAGUE INFORMATION ON THE MA COMPONENT M. Ould Haye School

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973] Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,

More information

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what

More information

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction Estimation Tasks Short Course on Image Quality Matthew A. Kupinski Introduction Section 13.3 in B&M Keep in mind the similarities between estimation and classification Image-quality is a statistical concept

More information

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Matrix Support Functional and its Applications

Matrix Support Functional and its Applications Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Week 2 Spring Lecture 3. The Canonical normal means estimation problem (cont.).! (X) = X+ 1 X X, + (X) = X+ 1

Week 2 Spring Lecture 3. The Canonical normal means estimation problem (cont.).! (X) = X+ 1 X X, + (X) = X+ 1 Week 2 Spring 2009 Lecture 3. The Canonical normal means estimation problem (cont.). Shrink toward a common mean. Theorem. Let X N ; 2 I n. Let 0 < C 2 (n 3) (hence n 4). De ne!! (X) = X+ 1 Then C 2 X

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Particle Filtering for Data-Driven Simulation and Optimization

Particle Filtering for Data-Driven Simulation and Optimization Particle Filtering for Data-Driven Simulation and Optimization John R. Birge The University of Chicago Booth School of Business Includes joint work with Nicholas Polson. JRBirge INFORMS Phoenix, October

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1 Math 66/566 - Midterm Solutions NOTE: These solutions are for both the 66 and 566 exam. The problems are the same until questions and 5. 1. The moment generating function of a random variable X is M(t)

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Exploiting k-nearest Neighbor Information with Many Data

Exploiting k-nearest Neighbor Information with Many Data Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

9.520 Problem Set 2. Due April 25, 2011

9.520 Problem Set 2. Due April 25, 2011 9.50 Problem Set Due April 5, 011 Note: there are five problems in total in this set. Problem 1 In classification problems where the data are unbalanced (there are many more examples of one class than

More information

Minimum Error Rate Classification

Minimum Error Rate Classification Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Outline. Motivation Contest Sample. Estimator. Loss. Standard Error. Prior Pseudo-Data. Bayesian Estimator. Estimators. John Dodson.

Outline. Motivation Contest Sample. Estimator. Loss. Standard Error. Prior Pseudo-Data. Bayesian Estimator. Estimators. John Dodson. s s Practitioner Course: Portfolio Optimization September 24, 2008 s The Goal of s The goal of estimation is to assign numerical values to the parameters of a probability model. Considerations There are

More information