Multi Task Averaging: Theory and Practice
|
|
- Jeffry Phelps
- 5 years ago
- Views:
Transcription
1 Multi Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington Sergey Feldman Univ. Washington Bela Frigyik Univ. Pecs 1
2 Aristotle The idea of a mean is old: \By the mean of a thing I denote a point equally distant from either extreme..." -Aristotle y min v y max v y min = y max v v = y min+y max 2 2
3 Tycho Brahe (16 th century) Averaged to reduce measurement error. ¹y = 1 N P N i=1 y i 3
4 Legendre (1805) Legendre noted the mean minimizes squared error: ¹y =argmin ^¹ NX (y i ^¹) 2 i=1 4
5 Legendre (1805) Legendre noted the mean minimizes squared error: ¹y =argmin ^¹ NX i=1 (y i ^¹) 2 ( Banerjee et al. 2005: the mean minimizes any Bregman divergence: ¹y =argmin ^¹ NX i=1 Ã(y i ; ^¹) Frigyik et al. 2008: the mean minimizes any functional Bregman divergence. ) 5
6 Gauss (1809) The average was central to Gauss's construction of the normal distribution. His goals: - a smooth distribution - whose likelihood peak was at the sample mean. 6
7 Fisher 1922 \...no other statistic which can be calculated from thesamesample provides any additional information as to the value of the parameter..." -R. A. Fisher 7
8 Stein s Paradox 1956 Total squared error can be reduced by estimating each of the means of T Gaussian random variables using data sampled from all of them, even if the random variables are independent and have di erent means. ¹ 1 t =1 ¹ 2 t =2. ¹ 3 t = T 8
9 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T ¹ 1 ¹ 2 ¹ 3 9
10 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T Maximum Likelihood Estimate ^¹ t = Y t ¹ 1 ¹ 2 ¹ 3 10
11 Stein Estimation: One Sample Case Problem: estimate means f¹ t g of T Gaussian random variables Given: random sample Y t»n(¹ t ;¾ 2 )fort =1;:::;T Maximum Likelihood Estimate ^¹ t = Y t ¹ 1 James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t ¹ 2 ¹ 3 11
12 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t 12
13 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known E[¹ t jy t ]= μ1 ¾2 Y 2 + ¾ 2 t ; James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t 13
14 James Stein Estimator Derivation: Efron and Morris 1972 Empirical Bayes Argument Key assumptions: ² ¹ t»n(0; 2 ) ² 2 unknown ² Y t»n(¹ t ;¾ 2 ) ² ¾ 2 known E " # (T 2)¾ 2 P T r=1 Y r 2 E[¹ t jy t ]= = ¾2 2 + ¾ 2 μ1 ¾2 Y 2 + ¾ 2 t ; James-Stein Estimate ^¹ JS t = Ã 1 (T 2)¾2 P T r=1 Y r 2! Y t
15 A More General JSE (Bock, 1972) ² ¹ t»n(»; 2 ) ² 2 and» unknown ² Y ti»n(¹ t ;¾ 2 t ) i =1;:::;N t ² ¾ 2 t unknown ^¹ JS t = ^» + General James-Stein Estimate ³ ³ T 3 1 ¹Yt (¹Y») T 1 ( ¹Y») + ^» ² ^» = 1 T P T r=1 ¹ Y r ² Positive-part: (x) + =max(x; 0) ² Diagonal with tt = ^¾2 t N t ² ¹Y is a T length vector with tth entry ¹Y t 15
16 James Stein Dominates James's and Stein's theorem (1961): For T>3the general JSE dominates the sample average (MLE): E[jj¹ ^¹ JS jj 2 2 ] E[jj¹ ¹Y jj 2 2 ] for every choice of ¹. This is also written as: R(^¹ JS ) R( ¹Y ) 16
17 James-Stein Estimation Ã! Empirical Bayes Multi-task Averaging (MTA) Ã! Empirical Loss Minimization with Regularization (Tikhonov Regularization) (Empirical Vapnik) 17
18 Multi Task Averaging Feldman et al Problem: estimate means f¹ t g of T random variables. Data Model: Y ti drawn IID from º t with nite mean ¹ t. Given: N t IID samples fy ti g N t i=1 from each random variable. ¹ 1 t =1 ¹ 2 t =2. ¹ 3 t = T 18
19 Building the MTA Objective \Single-task" averaging: ¹y t =argmin ^¹ t XN t i=1 (y ti ^¹ t ) 2 19
20 Building the MTA Objective \Single-task" averaging: f¹y t g T t=1 =arg min f^¹ t g T t=1 1 T TX XN t (y ti ^¹ t ) 2 t=1 i=1 add across tasks 20
21 Building the MTA Objective \Single-task" averaging: f¹y t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t Mahalanobis distance 21
22 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Mahalanobis distance to samples from task t similarity between task r and s 22
23 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Task 1: Estimate average movie ticket price Task 2: Estimate mean age of kids at summer camp 23
24 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 Task 1: Estimate average movie ticket price Task 2: Estimate price of tea in China? 24
25 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 25
26 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 26
27 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 27
28 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 28
29 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 t =1 ¹y 2 t =2 29
30 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ¹y 1 ^¹ MTA 1 t =1 ^¹ MTA 2 ¹y 2 t =2 30
31 \Multi-task" averaging: The MTA Objective f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 empirical loss lowers bias regularizer: lowers estimation variance 31
32 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 For non-negative A: ^¹ MTA = ³ I + T L 1 ¹y vector of T MTA solution diagonal matrix of sample mean variances ¾2 t graph Laplacian of A + A T vector of T sample averages N t 32
33 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y Lemma: this inverse always exists if A rs 0, 0, and N t 1. 33
34 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y MTA estimates are a linear combo of sample averages T T matrix W T sample averages 34
35 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y Theorem: convex combo of sample averages right-stochastic matrix W T sample averages 35
36 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 N 1 samples, ¾ 2 1 Y 2i = ¹ 2 + ² 2 N 2 samples, ¾ 2 2 Task 1: Estimate average movie ticket price Task 2: Estimate mean age of kids at summer camp 36
37 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 N 1 samples, ¾ 2 1 Y 2i = ¹ 2 + ² 2 N 2 samples, ¾ 2 2 MTA estimate I + L 1 ¹ T Y : Ã! T + ¾2 2 ^¹ MTA N 1 = 2 A 12 ¹Y 1 + T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12 Ã ¾ 2 1 N 1 A 12 T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12! ¹Y 2 : 37
38 When is MTA Better than the sample means? Y 1i = ¹ 1 + ² 1 Y 2i = ¹ 2 + ² 2 N 1 samples, ¾ 2 1 N 2 samples, ¾ 2 2 MTA estimate I + L 1 ¹ T Y : Ã! T + ¾2 2 ^¹ MTA N 1 = 2 A 12 ¹Y 1 + T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12 Ã ¾ 2 1 N 1 A 12 T + ¾2 1 N 1 A 12 + ¾2 2 N 2 A 12! ¹Y 2 : Biased, but smaller error variance than sample averages. Risk[^¹ MTA 1 ] < Risk[ ¹ Y 1 ]if(¹ 1 ¹ 2 ) 2 < 4 A 12 + ¾2 1 N 1 + ¾2 2 N 2 38
39 Risk[^¹ MTA ] < Risk[ ¹ Y]if(¹ 1 ¹ 2 ) 2 ¾2 1 N 1 ¾2 2 N 2 < 4 A 12 ¹ 2 39
40 Optimal A for T = 2 Example: ¹ 1 ¹ 2 Y 1 Y 2 Answer: the optimal task similarity in terms of MSE: A 12 = 2 (¹ 1 ¹ 2 ) 2 40
41 Estimated Optimal A for T = 2 Example: ¹ 1 ¹ 2 Y 1 Y 2 Answer: the optimal task similarity in terms of MSE: A 12 = 2 (¹ 1 ¹ 2 ) 2 estimated ^A 12 = 2 (¹y 1 ¹y 2 ) 2 41
42 ¾ 2 1 =1 ¾ 2 2 =1 Optimal sim: A 12 = 2 (¹ 1 ¹ 2 ) 2 ¹ 1 =0;¹ 2 =1 A 12 42
43 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. 43
44 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² One solution: use pairwise estimate to populate A: ^A rs = 2 (¹y r ¹y s ) 2 ^A 12 ^A 13 ^A 21 ^A 23 ^A 31 ^A 32 44
45 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. a a a a a
46 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. ² Analyzable: optimal similarity a is a = 1 T (T 1) T r=1 2 T s=1 (¹ r ¹ s ) 2 a a a average squared distance between the T means a a 46
47 Optimal A for T > 2 ² Bad news: no simple analytical minimization of risk for T>2. ² Onesolution:usepairwiseestimatetopopulateA ² Better solution: constrain A = a11 T and optimize over a. ² Analyzable: optimal similarity a is a = 1 T (T 1) T r=1 2 T s=1 (¹ r ¹ s ) 2 a a ² In practice, we estimate a ^a = 1 T (T 1) T r=1 2 T s=1 (¹y r ¹y s ) 2 a a 47
48 How to Set Similarity Matrix A? Choose A to minimize expected total squared error. Choose A to minimize worst-case total squared error. 48
49 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. (b l ;b u ) (b u ;b u ) (b l ;b l ) (b u ;b l )
50 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. (b l ;b u ) (b u ;b u ) (b l ;b l ) (b u ;b l )
51 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable 8 prior (LFP): 1 >< ; if (¹ 2 1;¹ 2 )=(b l ;b u ) 1 p(¹ 1 ;¹ 2 )= 2 >: ; if (¹ 1;¹ 2 )=(b u ;b l ) 0; otherwise. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l )
52 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with sim A mm 12 = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l )
53 Minimax A for T = 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with sim A mm 12 = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l ) ¹y 2 ^bl =min t ¹y t ¹y 1 ^bu =max t ¹y t
54 Minimax A for T > 2 To nd minimax A, we need: 1. A constraint set for f¹ t g: ¹ t 2 [b l ;b u ]. 2. A least favorable prior (LFP). 3. A Bayes-optimal estimator with constant risk w.r.t. the LFP: MTA with constant sim a mm = 2 (b u b l ) 2. (b l ;b u ) (b l ;b l ) (b u ;b u ) (b u ;b l ) ¹y 2 ¹y 1 ¹y 3 ¹y 4 ¹y 5 ^bl =min t ¹y t ^bu =max t ¹y t
55 Estimator Summary 55
56 Simulations Gaussian Simulations Uniform Simulations ¹ t»n(0;¾¹ 2 ) ¹ t» U( p 3¾¹ 2 ; p 3¾¹ 2 ) ¾t 2» Gamma(0:9; 1:0) + 0:1 ¾2 t» U(0:1; 2:0) N t» Uf2;:::;100g N t» Uf2;:::;100g y ti» N (¹ t ;¾t 2 ) y ti» U[¹ t p 3¾t 2 ;¹ t + p 3¾t 2 ] 56
57 5 fold randomized Cross Validation 57
58 Gaussian Simulation, T = 5 (Lower is better.) 58
59 Gaussian Simulation, T = 25 59
60 Gaussian Simulation, T =
61 Uniform Simulation, T = 5 61
62 Uniform Simulation, T = 25 62
63 Uniform Simulation, T =
64 Scales O(T) Constant and minimax MTA weight matrices can be rewritten: W =(I + L(a11 T )) 1 =(I + (at I a11 T )) 1 =(I + at a 11 T ) 1 =(Z z1 T ) 1 = Z 1 + Z 1 z1 T Z T Z 1 z Sherman Morrison formula Z is diagonal so Z 1, Z 1 z,and1 T Z 1 can all be computed in O(T ) W ¹Y is O(T) 64
65 Application: Class Grades Problem: estimate nal grades f¹ t g of T students. Given: N homework grades fy ti g N i=1 from each student. 65
66 Application: Class Grades Problem: estimate nal grades f¹ t g of T students. Given: N homework grades fy ti g N i=1 from each student. ² 16 classrooms! 16 datasets. ² Uncurved grades normalized to be between 0 and 100. ² Pooled variance used for all tasks in a dataset. ² Final class grades include homeworks, projects, labs, quizzes, midterms, and the nal exam. 66
67 Percent change in risk vs. single-task. 67
68 Percent change in risk vs. single-task. 68
69 Percent change in risk vs. single-task. 69
70 Percent change in risk vs. single-task. 70
71 Application: Product Sales Exp 1: How much will tth customer spend on their next order? Given: $amountsfy ti g N t i=1 that tth customer spent on N t orders. ² T =477 ² y ti ranged from $15 to $480. ² N t ranged from 2 to
72 Application: Product Sales Exp 2: If you bought the tth puzzle, how much will you spend on your next order? Given: $amountsfy ti g N t i=1 that each of N t customers spent after buying the tth puzzle. ² T =77 ² y ti ranged from $0 to $480. ² N t ranged from 8 to
73 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: Customer T : 73
74 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: ¹ 1 ¹ 2 Customer T : ¹ T 74
75 Application: Product Sales No ground truth! use sample means from all data as ¹ t and use random half of data to get ¹y t. Customer 1: Customer 2: ¹y 1 ¹y 2 Customer T : ¹y T 75
76 Application: Product Sales Percent change in risk vs. single-task averaged over 1000 random splits. (Lower is better.)
77 Application: Product Sales Percent change in risk vs. single-task averaged over 1000 random splits. (Lower is better.) 77
78 Model Mismatch: 2008 Election Problem: What percent of tth state's vote will go to Obama and McCain on election day? Given: N t pre-election polls fy ti g N t i=1 from each state. 78
79 Model Mismatch: 2008 Election Problem: What percent of tth state's vote will go to Obama and McCain on election day? Given: N t pre-election polls fy ti g N t i=1 from each state. Percent change in average risk vs. single-task. (Lower is better.) 79
80 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as ^p(z)= 1 N P N i=1 K(x i;z) x 3 x 1 x 2 x 4 x 5 z 80
81 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as ^p(z)= 1 N P N i=1 K(x i;z) x 3 x 1 x 2 x 4 x 5 z 81
82 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as Equivalently, ^p(z)= 1 N P N i=1 K(x i;z) arg min ^y(z) NX (K(x i ;z) ^y(z)) 2 i=1 82
83 MTA Applied to Kernel Density Estimation KDE: Given that events fx i g happened, estimate the probability of event z as Equivalently, ^p(z)= 1 N P N i=1 K(x i;z) arg min ^y(z) NX (K(x i ;z) ^y(z)) 2 i=1 Use MTA to form a Multi-task KDE: arg min f^y t (z t )g T t=1 TX XN t TX (K t (x ti ;z t ) ^y t (z t )) 2 + t=1 i=1 r=1 TX A rs (^y r (z r ) ^y s (z s )) 2 s=1 83
84 MT KDE for Terrorism Risk Assessment Problem: Estimate the probability of terrorist events at 40,000 locations in Jerusalem, each location z;x i 2 R 74 T = 7 terrorist groups Task similarity matrix A from terrorism expert Mohammed Hafez: 84
85 MT KDE for Terrorism Risk Assessment Mean Reciprocal Rank of a Left-Out Event Suicides (T =17) Bombings(T =11) Single task James-Stein MTA constant MTA minimax Expert sim
86 Last Slide MTA is an intuitive, simple, accurate approach to estimating multiple means jointly. When can you estimate multiple means at once? Can you estimate the task similarities better? Learn more: see our 2012 NIPS paper or me for the journal paper 86
87 ^¹ = W ¹y right stochastic W ^¹ = W ¹y MTA: ^¹ = W ¹y W = I + T L 1 diagonal with tt 0 A rs 0, 0 ^¹ t = ¹y t +(1 ) P T r=1 r¹y r 0 < 1 P T r=1 r =1 r 0, 8r (J-S, more)
88 Bayesian Analysis: IGMRFs Recall that: P 1 N P T 2 r=1 s=1 A rs(^y r ^y s ) 2 = 1 2 ^yt L S ^y The above regularizer can be thought of as coming from an intrinsic (improper) GMRF prior (Rue and Held, '05): p(^y) =(2¼) T 2 jl S j 1 2 exp 1 2 ^yt L S ^y where L = D A Usually used for graphical models when L S is sparse. 88
89 Bayesian Analysis Assuming di erences are independent: p(^y) / TY r=1 ^Y TY 1 ^Y 2»N(0; 1=2 A e A for T =3 12) rs(^y r ^y s ) 2 ^Y 2 ^Y 3»N(0; 1=2 A 23 ) s=1 ^Y 1 ^Y 3»N(0; 1=2 A 13 ) ^Y 1 ^Y 3 =(^Y 1 ^Y 2 )+(^Y 2 ^Y 3 )! 1 A 13 = 1 A A 23 ^Y 1 ^Y 2 =(^Y 1 ^Y 3 )+(^Y 3 ^Y 2 )! 1 A 12 = 1 A A 32 ^Y 2 ^Y 3 =(^Y 2 ^Y 1 )+(^Y 1 ^Y 3 )! 1 A 23 = 1 A A 13 Impossible to satisfy all RHS with any nite A! 89
90 Related Multi Task Regularizers P T r=1 k r 1 T P T s=1 sk 2 2 Distance to mean (Evgeniou and Pontil, 2004) jj jj Trace norm (Abernethy et al., 2009) tr( T D 1 ) tr( 1 T ) Learned, shared feature covariance matrix (Argyriou et al., 2008) Learned task covariance matrix (Jacob et al., 2008 and Zhang and Yeung, 2010) P T r=1 P T s=1 A rsk r sk 2 2 Pairwise distance regularizer (Sheldon, 2008) or constraint (Kato et al., 2007) 90
91 Gaussian Simulation, T = 2 (Lower is better.) (Lower is better.) 91
92 Uniform Simulation, T = 2 (Lower is better.) 92
93 Pairwise T=5 Results (Lower is better.) 93
94 Oracle T=5 Results (Lower is better.) 94
95 Stein s Unbiased Risk Estimate ² True A 12 is depends on unknown ¹ t.wepluggedin¹y t to get: ^A 12 = 2 (¹y 1 ¹y 2 ) 2 ² Another approach: minimize Stein's unbiased risk estimate (SURE), which is an empirical proxy Q such that E[Q] =risk. ² Result: ^A SURE 12 = Ã 2 (¹y 1 ¹y 2 ) 2 ^¾2 1 N 1 ^¾2 2 N 2! + 95
96 SURE T=2 Experiments (Lower is better.) 96
97 Alternative Formulation MTA is: I + T L 1 ¹Y with optimal sim: A 12 = 2 (¹ 1 ¹ 2 ) 2 : MTA Variant is: 1=2 I + T L 1 1=2 ¹ Y; with optimal sim: A 12 = 2 ¹ 1 ¾ 1 ¹ 2 ¾ 2 2 : Di erent notions of distance! What if ¹ 1 =2,¾ 1 =1,¹ 2 =4,¾ 2 =2? 97
98 Alternative Formulation T=2 Results (Lower is better.) 98
99 MTA Closed Form Solution \Multi-task" averaging: f^¹ MTA t g T t=1 =arg min f^¹ t g T t=1 1 T TX t=1 XN t i=1 (y ti ^¹ t ) 2 ¾ 2 t + T 2 TX r=1 TX A rs (^¹ r ^¹ s ) 2 s=1 ^¹ MTA = ³ I + T L 1 ¹y more general form of regularized Laplacian kernel (Smola and Kondor, 2003) 99
100 Stein s Unbiased Risk Estimate ² True A 12 is depends on unknown ¹ t.wepluggedin¹y t to get: ^A 12 = 2 (¹y 1 ¹y 2 ) 2 ² Another approach: minimize Stein's unbiased risk estimate (SURE), which is an empirical proxy Q such that E[Q] =risk. ² Result: ^A SURE 12 = Ã 2 (¹y 1 ¹y 2 ) 2 ^¾2 1 N 1 ^¾2 2 N 2! + 100
101 SURE T=2 Experiments (Lower is better.) 101
Revisiting Stein s Paradox: Multi-Task Averaging
Journal of Machine Learning Research 5 04) 36-366 Submitted 8/; Revised 6/4; Published /4 Revisiting Stein s Paradox: Multi-ask Averaging Sergey Feldman Data Cowboys 96 3rd Ave. NE Seattle, WA 985, USA
More informationRevisiting Stein s Paradox: Multi-Task Averaging
Journal of Machine Learning Research 5 04) Submitted ; Published - Revisiting Stein s Paradox: Multi-ask Averaging Sergey Feldman Data Cowboys 96 3rd Ave. NE Seattle, WA 985, USA Maya R. Gupta Google 5
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationA Spectral Regularization Framework for Multi-Task Structure Learning
A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10
COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationSTA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources
STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationSTAT215: Solutions for Homework 2
STAT25: Solutions for Homework 2 Due: Wednesday, Feb 4. (0 pt) Suppose we take one observation, X, from the discrete distribution, x 2 0 2 Pr(X x θ) ( θ)/4 θ/2 /2 (3 θ)/2 θ/4, 0 θ Find an unbiased estimator
More informationGRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT
GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT Cha Zhang Joint work with Dinei Florêncio and Philip A. Chou Microsoft Research Outline Gaussian Markov Random Field Graph construction Graph transform
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression
ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom
More informationMachine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?
Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationEstimating the accuracy of a hypothesis Setting. Assume a binary classification setting
Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013
Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationLecture 5 September 19
IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More information3.0.1 Multivariate version and tensor product of experiments
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationManaging Uncertainty
Managing Uncertainty Bayesian Linear Regression and Kalman Filter December 4, 2017 Objectives The goal of this lab is multiple: 1. First it is a reminder of some central elementary notions of Bayesian
More informationJournal of Statistical Research 2007, Vol. 41, No. 1, pp Bangladesh
Journal of Statistical Research 007, Vol. 4, No., pp. 5 Bangladesh ISSN 056-4 X ESTIMATION OF AUTOREGRESSIVE COEFFICIENT IN AN ARMA(, ) MODEL WITH VAGUE INFORMATION ON THE MA COMPONENT M. Ould Haye School
More informationECE 275A Homework 7 Solutions
ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?
Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what
More informationLearning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31
Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationLecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]
Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More information1.1 Basis of Statistical Decision Theory
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,
More informationCSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015
CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what
More informationEstimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction
Estimation Tasks Short Course on Image Quality Matthew A. Kupinski Introduction Section 13.3 in B&M Keep in mind the similarities between estimation and classification Image-quality is a statistical concept
More informationModule 22: Bayesian Methods Lecture 9 A: Default prior selection
Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings
CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationMS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari
MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind
More informationIEOR 165 Lecture 7 1 Bias-Variance Tradeoff
IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationMatrix Support Functional and its Applications
Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationClassification Logistic Regression
Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction
More information16.1 Bounding Capacity with Covering Number
ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly
More informationStatistical Measures of Uncertainty in Inverse Problems
Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationFeature Learning for Multitask Learning Using l 2,1 Norm Minimization
Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationOutline lecture 2 2(30)
Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationWeek 2 Spring Lecture 3. The Canonical normal means estimation problem (cont.).! (X) = X+ 1 X X, + (X) = X+ 1
Week 2 Spring 2009 Lecture 3. The Canonical normal means estimation problem (cont.). Shrink toward a common mean. Theorem. Let X N ; 2 I n. Let 0 < C 2 (n 3) (hence n 4). De ne!! (X) = X+ 1 Then C 2 X
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.
On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationParticle Filtering for Data-Driven Simulation and Optimization
Particle Filtering for Data-Driven Simulation and Optimization John R. Birge The University of Chicago Booth School of Business Includes joint work with Nicholas Polson. JRBirge INFORMS Phoenix, October
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationHow to learn from very few examples?
How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationM(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1
Math 66/566 - Midterm Solutions NOTE: These solutions are for both the 66 and 566 exam. The problems are the same until questions and 5. 1. The moment generating function of a random variable X is M(t)
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationg-priors for Linear Regression
Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationExploiting k-nearest Neighbor Information with Many Data
Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationSome slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2
Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More information2 Statistical Estimation: Basic Concepts
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:
More information9.520 Problem Set 2. Due April 25, 2011
9.50 Problem Set Due April 5, 011 Note: there are five problems in total in this set. Problem 1 In classification problems where the data are unbalanced (there are many more examples of one class than
More informationMinimum Error Rate Classification
Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationHypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3
Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationOutline. Motivation Contest Sample. Estimator. Loss. Standard Error. Prior Pseudo-Data. Bayesian Estimator. Estimators. John Dodson.
s s Practitioner Course: Portfolio Optimization September 24, 2008 s The Goal of s The goal of estimation is to assign numerical values to the parameters of a probability model. Considerations There are
More information