Minimax Estimation of Kernel Mean Embeddings

Size: px

Start display at page:

Download "Minimax Estimation of Kernel Mean Embeddings"

Harold Knight
5 years ago
Views:

1 Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016

2 Collaborators Dr. Ilya Tolstikhin : Tübingen. Max Planck Institute for Intelligent Systems, Dr. Krikamol Muandet : Tübingen. Max Planck Institute for Intelligent Systems,

3 Kernel Mean Embedding (KME) Let k : X X R be a positive definite kernel. Kernel trick: y k(,y) {}}{ φ(y) Equivalently, Generalization: δ y X k(, x) dδ y (x) P k(, x) dp(x) =: µ P. X }{{} kernel mean embedding

4 Properties KME is a generalization of Characteristic function : k(, x) = e 1,x, x R d Moment generating function : k(, x) = e,x, x R d to arbitrary X. In general, many P can yield the same KME!! for k(x, y) = x, y, we have P µ P. Characteristic kernels: They ensure that no two different P can have the same KME. P k(, x) dp(x) is one-to-one. X Examples: Gaussian, Matérn,... (infinite dimensional RKHS)

5 Application: Two-Sample Problem Given random samples {X 1,..., X m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? γ(p, Q) : distance metric between P and Q. H 0 : P = Q H 0 : γ(p, Q) = 0 H 1 : P Q H 1 : γ(p, Q) > 0 Test: Say H 0 if ˆγ ( {X i } m i=1, {Y j} n j=1) < ε. Otherwise say H1. Idea: Use γ(p, Q) = k(, x) dp(x) k(, x) dq(x) Hk with k being characteristic.

6 More Applications Testing for independence (Gretton et al., 2008) Conditional independence tests (Fukumizu et al., 2008) Feature selection (Song et al., 2012) Distribution regression (Szabó et al., 2015) Causal inference (Lopez-Paz et al., 2015) Mixture density estimation (Sriperumbudur, 2011),...

7 Estimators of KME In applications, P is unknown and only samples {X i } n i=1 from it are known. A popular estimator of KME that has been employed in all these applications is the empirical estimator: ˆµ P = 1 n n k(, X i ) i=1 Theorem (Smola et al., 2007; Gretton et al., 2012; Lopez-Paz et al., 2015) Suppose sup x X k(x, x) C < where k is continuous. Then for any τ > 0, }) C 2Cτ P ({(X n i ) n i=1 : ˆµ P µ P Hk n + e τ. n Alternatively E ˆµ P µ P Hk C n for some C > 0.

8 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.

9 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.

10 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.

11 Main Message Question: Can we do better using some other estimators? Answer: for a large class of kernels the answer is NO. We can do better in terms of constant factors (Muandet et al., 2015). But not in terms of rates w.r.t. sample size n or dimensionality d (if X = R d ). Tool: Minimax theory

12 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

13 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

14 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

15 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

16 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

17 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

18 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

19 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

20 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

21 Minimax Estimator Statistical decision theory has two goals: Find the minimax risk, Mn(θ(P)). Find the minimax estimator that achieves this risk. Except in simple cases, finding both the minimax risk and the minimax estimator is usually very hard. So we settle for an estimator that achieves the minimax rate: [ ] sup E ρ(ˆθ a, θ(p)) }{{} M n (θ(p)) P P are bounded a n b n an bn, bn an

22 Minimax Estimator Statistical decision theory has two goals: Find the minimax risk, Mn(θ(P)). Find the minimax estimator that achieves this risk. Except in simple cases, finding both the minimax risk and the minimax estimator is usually very hard. So we settle for an estimator that achieves the minimax rate: [ ] sup E ρ(ˆθ a, θ(p)) }{{} M n (θ(p)) P P are bounded a n b n an bn, bn an

23 Minimax Estimator Suppose we have an estimator ˆθ such that [ ] sup E P P ρ(ˆθ, θ(p)) Cψ n If for some C > 0 and ψ n 0 as n. M n (θ(p)) cψ n for some c > 0, then ˆθ is minimax ψ n -rate optimal. Our Problem: θ(p) = µ P = k(, x) dp(x) ρ = H [ ] We have that sup P P E ρ(ˆθ, θ(p)) C n for ˆθ being an empirical estimator, shrinkage estimator and kernel density based estimator. What is M n (µ(p))?

24 Minimax Estimator Suppose we have an estimator ˆθ such that [ ] sup E P P ρ(ˆθ, θ(p)) Cψ n If for some C > 0 and ψ n 0 as n. M n (θ(p)) cψ n for some c > 0, then ˆθ is minimax ψ n -rate optimal. Our Problem: θ(p) = µ P = k(, x) dp(x) ρ = H [ ] We have that sup P P E ρ(ˆθ, θ(p)) C n for ˆθ being an empirical estimator, shrinkage estimator and kernel density based estimator. What is M n (µ(p))?

25 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.

26 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.

27 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.

28 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

29 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

30 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

31 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

32 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

33 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

34 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

35 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

36 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

37 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

38 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

39 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

40 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

41 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

42 Le Cam s Method Theorem Suppose there exists P 1, P 2 P such that: ρ(θ(p 1 ), θ(p 2 )) 2δ > 0; KL(P n 1 Pn 2 ) α <. Then M n (θ(p)) δ max ( e α 4, 1 ) α/2. 2 Strategy: Choose δ and guess two elements P 1 and P 2 so that the conditions are satisfied with α independent of n.

43 Main Results

44 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

45 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

46 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

47 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

48 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

49 Gaussian Kernel In words: If P is the set of all discrete distributions, then M n (µ(p)) 1 β 12 n. For any estimator ˆθ, there always exists a discrete distribution, P such that µ P cannot be estimated at a rate faster than n 1/2. Is such a result true if P is a class of distributions with smooth density?

50 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

51 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

52 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

53 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

54 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

55 General Result Theorem Suppose P is the set of all discrete distributions on R d. Let k be shift-invariant, i.e., k(x, y) = ψ(x y) with ψ C b (R d ) and characteristic. Assume there exists x 0 R d and β > 0 such that ψ(0) ψ(x 0 ) β. Then M n (µ(p)) 1 2β 24 n.

56 General Result Theorem Suppose P is the set of all distributions with infinitely differentiable densities on R d. Let k be shift-invariant, i.e., k(x, y) = ψ(x y) with ψ C b (R d ) and characteristic. Then there exists constants c ψ, ɛ ψ > 0 depending only on ψ such that for any n 1 ɛ ψ : M n (µ(p)) 1 8 cψ 2n. Idea: Exactly same as that of the Gaussian kernel. But the crucial work is in showing that there exists constants ɛ ψ,σ 2 and c ψ,σ 2 such that if µ 1 µ 2 2 ɛ ψ,σ 2 then µ(n(µ 1, σ 2 I )) µ(n(µ 2, σ 2 I )) Hk c ψ,σ 2 µ 1 µ 2.

57 Summary Mean embedding of distributions is popular in various applications. Various estimators of kernel mean are available. We provide a theoretical justification for using these estimators, particularly the empirical estimator. The empirical estimator of the mean embedding is minimax rate optimal with rate n 1/2.

58 Thank You

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose