Minimax Estimation of Kernel Mean Embeddings
|
|
- Harold Knight
- 5 years ago
- Views:
Transcription
1 Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016
2 Collaborators Dr. Ilya Tolstikhin : Tübingen. Max Planck Institute for Intelligent Systems, Dr. Krikamol Muandet : Tübingen. Max Planck Institute for Intelligent Systems,
3 Kernel Mean Embedding (KME) Let k : X X R be a positive definite kernel. Kernel trick: y k(,y) {}}{ φ(y) Equivalently, Generalization: δ y X k(, x) dδ y (x) P k(, x) dp(x) =: µ P. X }{{} kernel mean embedding
4 Properties KME is a generalization of Characteristic function : k(, x) = e 1,x, x R d Moment generating function : k(, x) = e,x, x R d to arbitrary X. In general, many P can yield the same KME!! for k(x, y) = x, y, we have P µ P. Characteristic kernels: They ensure that no two different P can have the same KME. P k(, x) dp(x) is one-to-one. X Examples: Gaussian, Matérn,... (infinite dimensional RKHS)
5 Application: Two-Sample Problem Given random samples {X 1,..., X m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? γ(p, Q) : distance metric between P and Q. H 0 : P = Q H 0 : γ(p, Q) = 0 H 1 : P Q H 1 : γ(p, Q) > 0 Test: Say H 0 if ˆγ ( {X i } m i=1, {Y j} n j=1) < ε. Otherwise say H1. Idea: Use γ(p, Q) = k(, x) dp(x) k(, x) dq(x) Hk with k being characteristic.
6 More Applications Testing for independence (Gretton et al., 2008) Conditional independence tests (Fukumizu et al., 2008) Feature selection (Song et al., 2012) Distribution regression (Szabó et al., 2015) Causal inference (Lopez-Paz et al., 2015) Mixture density estimation (Sriperumbudur, 2011),...
7 Estimators of KME In applications, P is unknown and only samples {X i } n i=1 from it are known. A popular estimator of KME that has been employed in all these applications is the empirical estimator: ˆµ P = 1 n n k(, X i ) i=1 Theorem (Smola et al., 2007; Gretton et al., 2012; Lopez-Paz et al., 2015) Suppose sup x X k(x, x) C < where k is continuous. Then for any τ > 0, }) C 2Cτ P ({(X n i ) n i=1 : ˆµ P µ P Hk n + e τ. n Alternatively E ˆµ P µ P Hk C n for some C > 0.
8 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.
9 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.
10 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.
11 Main Message Question: Can we do better using some other estimators? Answer: for a large class of kernels the answer is NO. We can do better in terms of constant factors (Muandet et al., 2015). But not in terms of rates w.r.t. sample size n or dimensionality d (if X = R d ). Tool: Minimax theory
12 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )
13 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )
14 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )
15 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )
16 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P
17 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P
18 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P
19 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P
20 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P
21 Minimax Estimator Statistical decision theory has two goals: Find the minimax risk, Mn(θ(P)). Find the minimax estimator that achieves this risk. Except in simple cases, finding both the minimax risk and the minimax estimator is usually very hard. So we settle for an estimator that achieves the minimax rate: [ ] sup E ρ(ˆθ a, θ(p)) }{{} M n (θ(p)) P P are bounded a n b n an bn, bn an
22 Minimax Estimator Statistical decision theory has two goals: Find the minimax risk, Mn(θ(P)). Find the minimax estimator that achieves this risk. Except in simple cases, finding both the minimax risk and the minimax estimator is usually very hard. So we settle for an estimator that achieves the minimax rate: [ ] sup E ρ(ˆθ a, θ(p)) }{{} M n (θ(p)) P P are bounded a n b n an bn, bn an
23 Minimax Estimator Suppose we have an estimator ˆθ such that [ ] sup E P P ρ(ˆθ, θ(p)) Cψ n If for some C > 0 and ψ n 0 as n. M n (θ(p)) cψ n for some c > 0, then ˆθ is minimax ψ n -rate optimal. Our Problem: θ(p) = µ P = k(, x) dp(x) ρ = H [ ] We have that sup P P E ρ(ˆθ, θ(p)) C n for ˆθ being an empirical estimator, shrinkage estimator and kernel density based estimator. What is M n (µ(p))?
24 Minimax Estimator Suppose we have an estimator ˆθ such that [ ] sup E P P ρ(ˆθ, θ(p)) Cψ n If for some C > 0 and ψ n 0 as n. M n (θ(p)) cψ n for some c > 0, then ˆθ is minimax ψ n -rate optimal. Our Problem: θ(p) = µ P = k(, x) dp(x) ρ = H [ ] We have that sup P P E ρ(ˆθ, θ(p)) C n for ˆθ being an empirical estimator, shrinkage estimator and kernel density based estimator. What is M n (µ(p))?
25 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.
26 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.
27 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.
28 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)
29 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)
30 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)
31 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)
32 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error
33 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error
34 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error
35 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error
36 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error
37 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.
38 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.
39 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.
40 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.
41 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.
42 Le Cam s Method Theorem Suppose there exists P 1, P 2 P such that: ρ(θ(p 1 ), θ(p 2 )) 2δ > 0; KL(P n 1 Pn 2 ) α <. Then M n (θ(p)) δ max ( e α 4, 1 ) α/2. 2 Strategy: Choose δ and guess two elements P 1 and P 2 so that the conditions are satisfied with α independent of n.
43 Main Results
44 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n
45 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n
46 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n
47 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n
48 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n
49 Gaussian Kernel In words: If P is the set of all discrete distributions, then M n (µ(p)) 1 β 12 n. For any estimator ˆθ, there always exists a discrete distribution, P such that µ P cannot be estimated at a rate faster than n 1/2. Is such a result true if P is a class of distributions with smooth density?
50 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.
51 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.
52 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.
53 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.
54 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.
55 General Result Theorem Suppose P is the set of all discrete distributions on R d. Let k be shift-invariant, i.e., k(x, y) = ψ(x y) with ψ C b (R d ) and characteristic. Assume there exists x 0 R d and β > 0 such that ψ(0) ψ(x 0 ) β. Then M n (µ(p)) 1 2β 24 n.
56 General Result Theorem Suppose P is the set of all distributions with infinitely differentiable densities on R d. Let k be shift-invariant, i.e., k(x, y) = ψ(x y) with ψ C b (R d ) and characteristic. Then there exists constants c ψ, ɛ ψ > 0 depending only on ψ such that for any n 1 ɛ ψ : M n (µ(p)) 1 8 cψ 2n. Idea: Exactly same as that of the Gaussian kernel. But the crucial work is in showing that there exists constants ɛ ψ,σ 2 and c ψ,σ 2 such that if µ 1 µ 2 2 ɛ ψ,σ 2 then µ(n(µ 1, σ 2 I )) µ(n(µ 2, σ 2 I )) Hk c ψ,σ 2 µ 1 µ 2.
57 Summary Mean embedding of distributions is popular in various applications. Various estimators of kernel mean are available. We provide a theoretical justification for using these estimators, particularly the empirical estimator. The empirical estimator of the mean embedding is minimax rate optimal with rate n 1/2.
58 Thank You
Lecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationMinimax lower bounds I
Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationMinimax Rates. Homology Inference
Minimax Rates for Homology Inference Don Sheehy Joint work with Sivaraman Balakrishan, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman Something like a joke. Something like a joke. What is topological
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationLecture 2: Statistical Decision Theory (Part I)
Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More information3.0.1 Multivariate version and tensor product of experiments
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]
More information10-704: Information Processing and Learning Fall Lecture 24: Dec 7
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of
More informationMinimum Message Length Analysis of the Behrens Fisher Problem
Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationRobustness to Parametric Assumptions in Missing Data Models
Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice
More informationSTAT 830 Bayesian Estimation
STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationMIT Spring 2016
MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 3 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a)
More information4 Invariant Statistical Decision Problems
4 Invariant Statistical Decision Problems 4.1 Invariant decision problems Let G be a group of measurable transformations from the sample space X into itself. The group operation is composition. Note that
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationSTA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources
STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various
More informationModule 22: Bayesian Methods Lecture 9 A: Default prior selection
Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationConvergence of Quantum Statistical Experiments
Convergence of Quantum Statistical Experiments M!d!lin Gu"! University of Nottingham Jonas Kahn (Paris-Sud) Richard Gill (Leiden) Anna Jen#ová (Bratislava) Statistical experiments and statistical decision
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationWeb Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.
Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationModeling Real Estate Data using Quantile Regression
Modeling Real Estate Data using Semiparametric Quantile Regression Department of Statistics University of Innsbruck September 9th, 2011 Overview 1 Application: 2 3 4 Hedonic regression data for house prices
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More information1.1 Basis of Statistical Decision Theory
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationEstimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator
Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationA Resampling Method on Pivotal Estimating Functions
A Resampling Method on Pivotal Estimating Functions Kun Nie Biostat 277,Winter 2004 March 17, 2004 Outline Introduction A General Resampling Method Examples - Quantile Regression -Rank Regression -Simulation
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationLocal Minimax Testing
Local Minimax Testing Sivaraman Balakrishnan and Larry Wasserman Carnegie Mellon June 11, 2017 1 / 32 Hypothesis testing beyond classical regimes Example 1: Testing distribution of bacteria in gut microbiome
More informationAsymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationPersistent homology and nonparametric regression
Cleveland State University March 10, 2009, BIRS: Data Analysis using Computational Topology and Geometric Statistics joint work with Gunnar Carlsson (Stanford), Moo Chung (Wisconsin Madison), Peter Kim
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationLecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses
Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous
More information21.1 Lower bounds on minimax risk for functional estimation
ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will
More informationMinimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More information10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: Nov 4 Note: hese notes are based on scribed notes from Spring5 offering of this course LaeX template courtesy of UC
More informationLocal Asymptotic Normality in Quantum Statistics
Local Asymptotic Normality in Quantum Statistics M!d!lin Gu"! School of Mathematical Sciences University of Nottingham Richard Gill (Leiden) Jonas Kahn (Paris XI) Bas Janssens (Utrecht) Anna Jencova (Bratislava)
More informationKernel Mean Estimation via Spectral Filtering
Kernel Mean Estimation via Spectral Filtering Krikamol Muandet MPI-IS, Tübingen krikamol@tue.mpg.de Bharath Sriperumbudur Dept. of Statistics, PSU bks8@psu.edu Bernhard Schölkopf MPI-IS, Tübingen bs@tue.mpg.de
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationStatistical Measures of Uncertainty in Inverse Problems
Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department
More informationPart III. A Decision-Theoretic Approach and Bayesian testing
Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to
More informationAsymptotic efficiency of simple decisions for the compound decision problem
Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More information1 Exercises for lecture 1
1 Exercises for lecture 1 Exercise 1 a) Show that if F is symmetric with respect to µ, and E( X )
More information1 Hypothesis Testing and Model Selection
A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationInverse Statistical Learning
Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning
More informationBayesian Interpretations of Regularization
Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationLecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]
Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein
More informationThe Minimum Message Length Principle for Inductive Inference
The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,
More informationSTAT 830 Decision Theory and Bayesian Methods
STAT 830 Decision Theory and Bayesian Methods Example: Decide between 4 modes of transportation to work: B = Ride my bike. C = Take the car. T = Use public transit. H = Stay home. Costs depend on weather:
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015
More information5.2 Expounding on the Admissibility of Shrinkage Estimators
STAT 383C: Statistical Modeling I Fall 2015 Lecture 5 September 15 Lecturer: Purnamrita Sarkar Scribe: Ryan O Donnell Disclaimer: These scribe notes have been slightly proofread and may have typos etc
More informationSession 2B: Some basic simulation methods
Session 2B: Some basic simulation methods John Geweke Bayesian Econometrics and its Applications August 14, 2012 ohn Geweke Bayesian Econometrics and its Applications Session 2B: Some () basic simulation
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationDiscussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan
Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationShrinkage Estimation in High Dimensions
Shrinkage Estimation in High Dimensions Pavan Srinath and Ramji Venkataramanan University of Cambridge ITA 206 / 20 The Estimation Problem θ R n is a vector of parameters, to be estimated from an observation
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationSliced Inverse Regression
Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationTopic 12 Overview of Estimation
Topic 12 Overview of Estimation Classical Statistics 1 / 9 Outline Introduction Parameter Estimation Classical Statistics Densities and Likelihoods 2 / 9 Introduction In the simplest possible terms, the
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf
Lecture 13: 2011 Bootstrap ) R n x n, θ P)) = τ n ˆθn θ P) Example: ˆθn = X n, τ n = n, θ = EX = µ P) ˆθ = min X n, τ n = n, θ P) = sup{x : F x) 0} ) Define: J n P), the distribution of τ n ˆθ n θ P) under
More informationBIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation
BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)
More informationMultivariate Topological Data Analysis
Cleveland State University November 20, 2008, Duke University joint work with Gunnar Carlsson (Stanford), Peter Kim and Zhiming Luo (Guelph), and Moo Chung (Wisconsin Madison) Framework Ideal Truth (Parameter)
More informationarxiv: v4 [stat.me] 27 Nov 2017
CLASSIFICATION OF LOCAL FIELD POTENTIALS USING GAUSSIAN SEQUENCE MODEL Taposh Banerjee John Choi Bijan Pesaran Demba Ba and Vahid Tarokh School of Engineering and Applied Sciences, Harvard University Center
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationA Very Brief Summary of Statistical Inference, and Examples
A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationLecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax
More informationlarge number of i.i.d. observations from P. For concreteness, suppose
1 Subsampling Suppose X i, i = 1,..., n is an i.i.d. sequence of random variables with distribution P. Let θ(p ) be some real-valued parameter of interest, and let ˆθ n = ˆθ n (X 1,..., X n ) be some estimate
More informationA spectral clustering algorithm based on Gram operators
A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping
More informationLecture 14 October 13
STAT 383C: Statistical Modeling I Fall 2015 Lecture 14 October 13 Lecturer: Purnamrita Sarkar Scribe: Some one Disclaimer: These scribe notes have been slightly proofread and may have typos etc. Note:
More informationNonparametric Inference In Functional Data
Nonparametric Inference In Functional Data Zuofeng Shang Purdue University Joint work with Guang Cheng from Purdue Univ. An Example Consider the functional linear model: Y = α + where 1 0 X(t)β(t)dt +
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More information