Minimax Estimation of Kernel Mean Embeddings

Size: px
Start display at page:

Download "Minimax Estimation of Kernel Mean Embeddings"

Transcription

1 Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016

2 Collaborators Dr. Ilya Tolstikhin : Tübingen. Max Planck Institute for Intelligent Systems, Dr. Krikamol Muandet : Tübingen. Max Planck Institute for Intelligent Systems,

3 Kernel Mean Embedding (KME) Let k : X X R be a positive definite kernel. Kernel trick: y k(,y) {}}{ φ(y) Equivalently, Generalization: δ y X k(, x) dδ y (x) P k(, x) dp(x) =: µ P. X }{{} kernel mean embedding

4 Properties KME is a generalization of Characteristic function : k(, x) = e 1,x, x R d Moment generating function : k(, x) = e,x, x R d to arbitrary X. In general, many P can yield the same KME!! for k(x, y) = x, y, we have P µ P. Characteristic kernels: They ensure that no two different P can have the same KME. P k(, x) dp(x) is one-to-one. X Examples: Gaussian, Matérn,... (infinite dimensional RKHS)

5 Application: Two-Sample Problem Given random samples {X 1,..., X m } i.i.d. P and {Y 1,..., Y n } i.i.d. Q. Determine: P = Q or P Q? γ(p, Q) : distance metric between P and Q. H 0 : P = Q H 0 : γ(p, Q) = 0 H 1 : P Q H 1 : γ(p, Q) > 0 Test: Say H 0 if ˆγ ( {X i } m i=1, {Y j} n j=1) < ε. Otherwise say H1. Idea: Use γ(p, Q) = k(, x) dp(x) k(, x) dq(x) Hk with k being characteristic.

6 More Applications Testing for independence (Gretton et al., 2008) Conditional independence tests (Fukumizu et al., 2008) Feature selection (Song et al., 2012) Distribution regression (Szabó et al., 2015) Causal inference (Lopez-Paz et al., 2015) Mixture density estimation (Sriperumbudur, 2011),...

7 Estimators of KME In applications, P is unknown and only samples {X i } n i=1 from it are known. A popular estimator of KME that has been employed in all these applications is the empirical estimator: ˆµ P = 1 n n k(, X i ) i=1 Theorem (Smola et al., 2007; Gretton et al., 2012; Lopez-Paz et al., 2015) Suppose sup x X k(x, x) C < where k is continuous. Then for any τ > 0, }) C 2Cτ P ({(X n i ) n i=1 : ˆµ P µ P Hk n + e τ. n Alternatively E ˆµ P µ P Hk C n for some C > 0.

8 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.

9 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.

10 Shrinkage Estimator Given (X i ) n i=1 µ R d. i.i.d. N(µ, σ 2 I ), suppose we are interested in estimating Maximum likelihood estimator: ˆµ = 1 n n i=1 X i which is the empirical estimator. (James and Stein, 1961): constructed an estimator ˇµ such that for d 3 for all µ R d, E ˇµ µ 2 E ˆµ µ 2 and for at least one µ, the strict inequality holds. Kernel setting: Based on the above motivation, (Krikamol et al., 2015) proposed a shrinkage estimator, ˇµ P of µ P and showed that E ˇµ P µ P 2 H k < E ˆµ P µ P 2 H k + O p (n 3/2 ) as n and E ˇµ P µ P Hk C n 1/2 for some C > 0.

11 Main Message Question: Can we do better using some other estimators? Answer: for a large class of kernels the answer is NO. We can do better in terms of constant factors (Muandet et al., 2015). But not in terms of rates w.r.t. sample size n or dimensionality d (if X = R d ). Tool: Minimax theory

12 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

13 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

14 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

15 Estimation Theory: Setup Given: A class of distributions P on a sample space X ; A mapping θ : P Θ, P θ(p). Goal: Estimate θ(p) based on i.i.d. observations (X i ) n i=1 drawn from the unknown distribution P. Examples: P = { N(θ, σ 2 ) : θ R} with known variance: θ(p) = x dp(x). P = {set of all distributions}: θ(p) = k(, x) dp(x). Estimator: ˆθ(X 1,..., X n )

16 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

17 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

18 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

19 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

20 Minimax Risk How good is the estimator, ˆθ? Define a distance ρ : Θ Θ R to measure the error of ˆθ for the parameter θ. The average performance of ˆθ is measured by the risk: [ ] R(ˆθ; P) = E ρ(ˆθ, θ(p)). Obviously, we would want an estimator that has the smallest risk for every P : not achievable!! Global view: Minimize the average risk (Bayesian view) or the maximum risk, [ ] sup E ρ(ˆθ, θ(p)) P P ˆθ is called a minimax estimator if M n(θ(p)) : minimax risk [ ] { [}} ]{ sup E ρ(ˆθ, θ(p)) = inf sup E ρ(ˆθ, θ(p)). P P ˆθ P P

21 Minimax Estimator Statistical decision theory has two goals: Find the minimax risk, Mn(θ(P)). Find the minimax estimator that achieves this risk. Except in simple cases, finding both the minimax risk and the minimax estimator is usually very hard. So we settle for an estimator that achieves the minimax rate: [ ] sup E ρ(ˆθ a, θ(p)) }{{} M n (θ(p)) P P are bounded a n b n an bn, bn an

22 Minimax Estimator Statistical decision theory has two goals: Find the minimax risk, Mn(θ(P)). Find the minimax estimator that achieves this risk. Except in simple cases, finding both the minimax risk and the minimax estimator is usually very hard. So we settle for an estimator that achieves the minimax rate: [ ] sup E ρ(ˆθ a, θ(p)) }{{} M n (θ(p)) P P are bounded a n b n an bn, bn an

23 Minimax Estimator Suppose we have an estimator ˆθ such that [ ] sup E P P ρ(ˆθ, θ(p)) Cψ n If for some C > 0 and ψ n 0 as n. M n (θ(p)) cψ n for some c > 0, then ˆθ is minimax ψ n -rate optimal. Our Problem: θ(p) = µ P = k(, x) dp(x) ρ = H [ ] We have that sup P P E ρ(ˆθ, θ(p)) C n for ˆθ being an empirical estimator, shrinkage estimator and kernel density based estimator. What is M n (µ(p))?

24 Minimax Estimator Suppose we have an estimator ˆθ such that [ ] sup E P P ρ(ˆθ, θ(p)) Cψ n If for some C > 0 and ψ n 0 as n. M n (θ(p)) cψ n for some c > 0, then ˆθ is minimax ψ n -rate optimal. Our Problem: θ(p) = µ P = k(, x) dp(x) ρ = H [ ] We have that sup P P E ρ(ˆθ, θ(p)) C n for ˆθ being an empirical estimator, shrinkage estimator and kernel density based estimator. What is M n (µ(p))?

25 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.

26 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.

27 From Estimation to Testing Key Idea: Reduce the estimation problem to a testing problem and bound M n (θ(p)) in terms of the probability of error in testing problems. Setup: Let {P v } v V P where V = {1,..., M}. The family induces a collection of parameters {θ(p v )} v V. Choose {P v } v V such that ρ(θ(p v ), θ(p v )) 2δ, for all v v. Suppose we observe (X i ) n i=1 is drawn from the n-fold product distribution, P n v for some v V. Construct ˆθ(X 1,..., X n ). Testing problem: Based on (X i ) n i=1, test which of M hypothesis is true.

28 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

29 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

30 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

31 From Estimation to Testing For a measurable mapping Ψ : X n V, the error probability is defined as max v V Pn v (Ψ(X 1,..., X n ) v). Minimum distance test: ρ(ˆθ, θ(p v )) < δ = Ψ = v Ψ v = ρ(ˆθ, θ(p v )) δ Ψ = arg min v V ρ(ˆθ, θ(p v )) P n v (ρ(ˆθ, θ(p v )) δ) P n v (Ψ v)

32 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

33 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

34 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

35 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

36 From Estimation to Testing M n (θ(p)) = inf ˆθ δ inf ˆθ [ ] E ρ(ˆθ, θ(p)) ( ) sup P n ρ(ˆθ, θ(p)) δ sup P P P P δ inf max ˆθ v V Pn v δ inf max Ψ ( ) ρ(ˆθ, θ(p v )) δ v V Pn v (Ψ v) }{{} minimax probability of error

37 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

38 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

39 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

40 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

41 Minimax Probability of Error Suppose M = 2, i.e., V = {1, 2}. Then inf Ψ max v V Pn v (Ψ v) 1 2 inf Ψ [Pn 1(Ψ 1) + P n 2(Ψ 2)] The minimizer is the likelihood ratio test and so inf max Ψ v V Pn v (Ψ v) 1 min(dp n 2 1, dp n 2) = 1 Pn 1 Pn 2 TV 2. M n (θ(p)) δ 2 (1 Pn 1 P n 2 TV ) Recipe: Pick P 1 and P 2 in P such that P n 1 Pn 2 TV 1 2 and ρ(θ(p 1 ), θ(p 2 )) 2δ. (Le Cam, 1973) General theme: The minimax risk is related to the distance between distributions.

42 Le Cam s Method Theorem Suppose there exists P 1, P 2 P such that: ρ(θ(p 1 ), θ(p 2 )) 2δ > 0; KL(P n 1 Pn 2 ) α <. Then M n (θ(p)) δ max ( e α 4, 1 ) α/2. 2 Strategy: Choose δ and guess two elements P 1 and P 2 so that the conditions are satisfied with α independent of n.

43 Main Results

44 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

45 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

46 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

47 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

48 Gaussian Kernel ) Let k(x, y) = exp ( x y 2 2η, η > 0. Choose 2 P 1 = p 1 δ y + (1 p 1 )δ z and P 2 = p 2 δ y + (1 p 2 )δ z where y, z R d, p 1 > 0 and p 2 > 0. ρ 2 (θ(p 1 ), θ(p 2 )) = µ P1 µ P2 2 H k )) = 2(p 1 p 2 ) (1 2 y z 2 exp ( 2η 2 KL(P n 1 Pn 2 ) n(p1 p2)2 p 2(1 p 2). 2 y z 2 2(p 1 p 2 ) 2η 2 if y z 2 2η 2. Choose p 2 = 1 2 and p 1 such that (p 1 p 2 ) 2 = 1 9n ; y, z such that y z 2 2η β > 0. 2 δ = β 9n

49 Gaussian Kernel In words: If P is the set of all discrete distributions, then M n (µ(p)) 1 β 12 n. For any estimator ˆθ, there always exists a discrete distribution, P such that µ P cannot be estimated at a rate faster than n 1/2. Is such a result true if P is a class of distributions with smooth density?

50 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

51 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

52 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

53 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

54 Gaussian Kernel Let k(x, y) = exp ) ( x y 2 2η, η > 0. Choose 2 P 1 = N(µ 1, σ 2 I ) and P 2 = N(µ 2, σ 2 I ) where µ 1, µ 2 R d and σ > 0. ( ρ 2 2η 2 (θ(p 1 ), θ(p 2 )) = 2 2η 2 + 4σ 2 ( KL(P n 1 Pn 2 ) = n µ1 µ2 2 2σ 2. 2η 2 2η 2 + 4σ 2 ) d ( 2 1 exp ( µ 1 µ 2 2 )) ) d 2 µ 1 µ 2 2 2η 2 + 4σ 2 2η 2 + 4σ 2 if µ 1 µ 2 2 2η 2 + 4σ 2. Choose µ 1 and µ 2 such that µ 1 µ 2 2 2σ2 α n δ = C n and σ 2 = η2 2d.

55 General Result Theorem Suppose P is the set of all discrete distributions on R d. Let k be shift-invariant, i.e., k(x, y) = ψ(x y) with ψ C b (R d ) and characteristic. Assume there exists x 0 R d and β > 0 such that ψ(0) ψ(x 0 ) β. Then M n (µ(p)) 1 2β 24 n.

56 General Result Theorem Suppose P is the set of all distributions with infinitely differentiable densities on R d. Let k be shift-invariant, i.e., k(x, y) = ψ(x y) with ψ C b (R d ) and characteristic. Then there exists constants c ψ, ɛ ψ > 0 depending only on ψ such that for any n 1 ɛ ψ : M n (µ(p)) 1 8 cψ 2n. Idea: Exactly same as that of the Gaussian kernel. But the crucial work is in showing that there exists constants ɛ ψ,σ 2 and c ψ,σ 2 such that if µ 1 µ 2 2 ɛ ψ,σ 2 then µ(n(µ 1, σ 2 I )) µ(n(µ 2, σ 2 I )) Hk c ψ,σ 2 µ 1 µ 2.

57 Summary Mean embedding of distributions is popular in various applications. Various estimators of kernel mean are available. We provide a theoretical justification for using these estimators, particularly the empirical estimator. The empirical estimator of the mean embedding is minimax rate optimal with rate n 1/2.

58 Thank You

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Minimax Rates. Homology Inference

Minimax Rates. Homology Inference Minimax Rates for Homology Inference Don Sheehy Joint work with Sivaraman Balakrishan, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman Something like a joke. Something like a joke. What is topological

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

10-704: Information Processing and Learning Fall Lecture 24: Dec 7 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Minimum Message Length Analysis of the Behrens Fisher Problem

Minimum Message Length Analysis of the Behrens Fisher Problem Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Robustness to Parametric Assumptions in Missing Data Models

Robustness to Parametric Assumptions in Missing Data Models Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice

More information

STAT 830 Bayesian Estimation

STAT 830 Bayesian Estimation STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 3 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a)

More information

4 Invariant Statistical Decision Problems

4 Invariant Statistical Decision Problems 4 Invariant Statistical Decision Problems 4.1 Invariant decision problems Let G be a group of measurable transformations from the sample space X into itself. The group operation is composition. Note that

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Convergence of Quantum Statistical Experiments

Convergence of Quantum Statistical Experiments Convergence of Quantum Statistical Experiments M!d!lin Gu"! University of Nottingham Jonas Kahn (Paris-Sud) Richard Gill (Leiden) Anna Jen#ová (Bratislava) Statistical experiments and statistical decision

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Modeling Real Estate Data using Quantile Regression

Modeling Real Estate Data using Quantile Regression Modeling Real Estate Data using Semiparametric Quantile Regression Department of Statistics University of Innsbruck September 9th, 2011 Overview 1 Application: 2 3 4 Hedonic regression data for house prices

More information

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

A Resampling Method on Pivotal Estimating Functions

A Resampling Method on Pivotal Estimating Functions A Resampling Method on Pivotal Estimating Functions Kun Nie Biostat 277,Winter 2004 March 17, 2004 Outline Introduction A General Resampling Method Examples - Quantile Regression -Rank Regression -Simulation

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Local Minimax Testing

Local Minimax Testing Local Minimax Testing Sivaraman Balakrishnan and Larry Wasserman Carnegie Mellon June 11, 2017 1 / 32 Hypothesis testing beyond classical regimes Example 1: Testing distribution of bacteria in gut microbiome

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Persistent homology and nonparametric regression

Persistent homology and nonparametric regression Cleveland State University March 10, 2009, BIRS: Data Analysis using Computational Topology and Geometric Statistics joint work with Gunnar Carlsson (Stanford), Moo Chung (Wisconsin Madison), Peter Kim

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: Nov 4 Note: hese notes are based on scribed notes from Spring5 offering of this course LaeX template courtesy of UC

More information

Local Asymptotic Normality in Quantum Statistics

Local Asymptotic Normality in Quantum Statistics Local Asymptotic Normality in Quantum Statistics M!d!lin Gu"! School of Mathematical Sciences University of Nottingham Richard Gill (Leiden) Jonas Kahn (Paris XI) Bas Janssens (Utrecht) Anna Jencova (Bratislava)

More information

Kernel Mean Estimation via Spectral Filtering

Kernel Mean Estimation via Spectral Filtering Kernel Mean Estimation via Spectral Filtering Krikamol Muandet MPI-IS, Tübingen krikamol@tue.mpg.de Bharath Sriperumbudur Dept. of Statistics, PSU bks8@psu.edu Bernhard Schölkopf MPI-IS, Tübingen bs@tue.mpg.de

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

1 Exercises for lecture 1

1 Exercises for lecture 1 1 Exercises for lecture 1 Exercise 1 a) Show that if F is symmetric with respect to µ, and E( X )

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

Inverse Statistical Learning

Inverse Statistical Learning Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973] Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein

More information

The Minimum Message Length Principle for Inductive Inference

The Minimum Message Length Principle for Inductive Inference The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,

More information

STAT 830 Decision Theory and Bayesian Methods

STAT 830 Decision Theory and Bayesian Methods STAT 830 Decision Theory and Bayesian Methods Example: Decide between 4 modes of transportation to work: B = Ride my bike. C = Take the car. T = Use public transit. H = Stay home. Costs depend on weather:

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015

More information

5.2 Expounding on the Admissibility of Shrinkage Estimators

5.2 Expounding on the Admissibility of Shrinkage Estimators STAT 383C: Statistical Modeling I Fall 2015 Lecture 5 September 15 Lecturer: Purnamrita Sarkar Scribe: Ryan O Donnell Disclaimer: These scribe notes have been slightly proofread and may have typos etc

More information

Session 2B: Some basic simulation methods

Session 2B: Some basic simulation methods Session 2B: Some basic simulation methods John Geweke Bayesian Econometrics and its Applications August 14, 2012 ohn Geweke Bayesian Econometrics and its Applications Session 2B: Some () basic simulation

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Shrinkage Estimation in High Dimensions

Shrinkage Estimation in High Dimensions Shrinkage Estimation in High Dimensions Pavan Srinath and Ramji Venkataramanan University of Cambridge ITA 206 / 20 The Estimation Problem θ R n is a vector of parameters, to be estimated from an observation

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Sliced Inverse Regression

Sliced Inverse Regression Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Topic 12 Overview of Estimation

Topic 12 Overview of Estimation Topic 12 Overview of Estimation Classical Statistics 1 / 9 Outline Introduction Parameter Estimation Classical Statistics Densities and Likelihoods 2 / 9 Introduction In the simplest possible terms, the

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf Lecture 13: 2011 Bootstrap ) R n x n, θ P)) = τ n ˆθn θ P) Example: ˆθn = X n, τ n = n, θ = EX = µ P) ˆθ = min X n, τ n = n, θ P) = sup{x : F x) 0} ) Define: J n P), the distribution of τ n ˆθ n θ P) under

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Multivariate Topological Data Analysis

Multivariate Topological Data Analysis Cleveland State University November 20, 2008, Duke University joint work with Gunnar Carlsson (Stanford), Peter Kim and Zhiming Luo (Guelph), and Moo Chung (Wisconsin Madison) Framework Ideal Truth (Parameter)

More information

arxiv: v4 [stat.me] 27 Nov 2017

arxiv: v4 [stat.me] 27 Nov 2017 CLASSIFICATION OF LOCAL FIELD POTENTIALS USING GAUSSIAN SEQUENCE MODEL Taposh Banerjee John Choi Bijan Pesaran Demba Ba and Vahid Tarokh School of Engineering and Applied Sciences, Harvard University Center

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

large number of i.i.d. observations from P. For concreteness, suppose

large number of i.i.d. observations from P. For concreteness, suppose 1 Subsampling Suppose X i, i = 1,..., n is an i.i.d. sequence of random variables with distribution P. Let θ(p ) be some real-valued parameter of interest, and let ˆθ n = ˆθ n (X 1,..., X n ) be some estimate

More information

A spectral clustering algorithm based on Gram operators

A spectral clustering algorithm based on Gram operators A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping

More information

Lecture 14 October 13

Lecture 14 October 13 STAT 383C: Statistical Modeling I Fall 2015 Lecture 14 October 13 Lecturer: Purnamrita Sarkar Scribe: Some one Disclaimer: These scribe notes have been slightly proofread and may have typos etc. Note:

More information

Nonparametric Inference In Functional Data

Nonparametric Inference In Functional Data Nonparametric Inference In Functional Data Zuofeng Shang Purdue University Joint work with Guang Cheng from Purdue Univ. An Example Consider the functional linear model: Y = α + where 1 0 X(t)β(t)dt +

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information