Surrogate loss functions, divergences and decentralized detection

Similar documents
Decentralized decision making with spatially distributed data

Are You a Bayesian or a Frequentist?

Divergences, surrogate loss functions and experimental design

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

Nonparametric Decentralized Detection using Kernel Methods

Decentralized Detection and Classification using Kernel Methods

On surrogate loss functions and f-divergences

ADECENTRALIZED detection system typically involves a

Estimating divergence functionals and the likelihood ratio by convex risk minimization

On divergences, surrogate loss functions, and decentralized detection

Decentralized Detection and Classification using Kernel Methods

ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1

Statistical Properties of Large Margin Classifiers

On distance measures, surrogate loss functions, and distributed detection

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel

Logistic Regression. Machine Learning Fall 2018

Lecture 10 February 23

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning Practice Page 2 of 2 10/28/13

Linear & nonlinear classifiers

Introduction to Logistic Regression and Support Vector Machine

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Approximation Theoretical Questions for SVMs

Support Vector Machines

Accelerated Training of Max-Margin Markov Networks with Kernels

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Statistical Data Mining and Machine Learning Hilary Term 2016

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Parameter learning in CRF s

Computing regularization paths for learning multiple kernels

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 9: PGM Learning

Machine Learning for NLP

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Learning with Rejection

13: Variational inference II

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 35: December The fundamental statistical distances

CMU-Q Lecture 24:

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Bayesian Support Vector Machines for Feature Ranking and Selection

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning Binary Classifiers for Multi-Class Problem

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Stanford Statistics 311/Electrical Engineering 377

Linear & nonlinear classifiers

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

STA141C: Big Data & High Performance Statistical Computing

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

9 Classification. 9.1 Linear Classifiers

10-701/ Machine Learning - Midterm Exam, Fall 2010

A Magiv CV Theory for Large-Margin Classifiers

CS Lecture 19. Exponential Families & Expectation Propagation

Oslo Class 2 Tikhonov regularization and kernels

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Lecture 18: Multiclass Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Warm up: risk prediction with logistic regression

Support Vector Machine (continued)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Discriminative Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Posterior Regularization

Bayesian Machine Learning

Bayesian Regularization

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSCI-567: Machine Learning (Spring 2019)

A Study of Relative Efficiency and Robustness of Classification Methods

DIVERGENCES (or pseudodistances) based on likelihood

RegML 2018 Class 2 Tikhonov regularization and kernels

CS229 Supplemental Lecture notes

Support Vector Machines and Kernel Methods

Convexity, Detection, and Generalized f-divergences

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Introduction to Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines for Classification: A Statistical Portrait

An Introduction to Expectation-Maximization

Nonparametric Bayesian Methods (Gaussian Processes)

Statistical Machine Learning Hilary Term 2018

Algorithms for Predicting Structured Data

Does Modeling Lead to More Accurate Classification?

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning (II)

U Logo Use Guidelines

Probabilistic Graphical Models & Applications

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Probabilistic Graphical Models

18.9 SUPPORT VECTOR MACHINES

Information Geometry

Transcription:

Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

Talk outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels use of convex analysis study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 2

Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) X Q Z γ Y 3

Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) roles of experiment Q: X Q Z γ Y due to data collection constraints (e.g., decentralization) data transmission constraints choice of variates (feature selection) dimensionality reduction scheme 3-a

A decentralized detection system... Sensors Communication channel Fusion center Decision rule Decentralized setting: Communication constraints between sensors and fusion center (e.g., bit constraints) Goal: Design decision rules for sensors and fusion center Criterion: Minimize probability of incorrect detection 4

Concrete example wireless sensor network Light source............ sensors Set-up: wireless network of tiny sensor motes, each equiped with light/ humidity/ temperature sensing capabilities measurement of signal strength ([0 1024] in magnitude, or 10 bits) Goal: is there a forest fire in a certain region? 5

Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure 6

Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure Decentralized detection in signal processing (e.g., Tsitsiklis, 1993) joint distribution assumed to be known locally-optimal rules under conditional independence assumptions (i.e., Naive Bayes) 6-a

Overview of our approach Treat as a nonparametric estimation (learning) problem under constraints from a distributed system Use kernel methods and convex surrogate loss functions tools from convex optimization to derive an efficient algorithm 7

Problem set-up Y {±1} X 1 X 2 X 3... X S γ 1 γ 2 γ 3... γ S Z 1 Z 2 Z 3... Z S X {1,...,M} S Z {1,...,L} S ; L M γ(z 1,...,Z S ) Problem: Given training data (x i, y i ) n i=1, find the decision rules (γ 1,...,γ s ; γ) so as to minimize the detection error probability: P(Y γ(z 1,...,Z s )). 8

Kernel methods for classification Classification: Learn γ(z) that predicts label y K(z, z ) is a symmetric positive semidefinite kernel function feature space H in which K acts as an inner product, i.e., K(z, z ) = Ψ(z), Ψ(z ) Kernel-based algorithm finds linear function in H, i.e. γ(z) = w, Ψ(z) = n α i K(z i, z) i=1 Advantages: kernel function classes are sufficiently rich for many applications optimizing over kernel function classes is computionally efficient 9

Convex surrogate loss function φ to 0-1 loss 3 2.5 Zero one Hinge Logistic Exponential Surrogate loss 2 1.5 1 0.5 0 1 0.5 0 0.5 1 1.5 2 Margin value minimizing (regularized) empirical φ-risk Êφ(Y γ(z)): n φ(y i γ(z i )) + λ 2 γ 2, min γ H i=1 (z i, y i ) n i=1 are training data in Z {±1} φ is a convex loss function (surrogate to non-convex 0-1 loss) 10

Stochastic decision rules at each sensor Y X 1 Z 1 X 2 Z 2 X 3 Z 3... X S...... Z S γ(z) Q(Z X) K z (z, z ) = Ψ(z), Ψ(z ) γ(z) = w, Ψ(z) Approximate deterministic sensor decisions by stochastic rules Q(Z X) Sensors do not communicate directly = factorization: Q(Z X) = S t=1 Qt (Z t X t ) Q = Q t, The overall decision rule is represented by γ(z) = w, Ψ(z) 11

High-level strategy: Joint optimization Minimize over (Q, γ) an empirical version of Eφ(Y γ(z)) Joint minimization: fix Q, optimize over γ: A simple convex problem fix γ, perform a gradient update for Q, sensor by sensor 12

Approximating empirical φ-risk The regularized empirical φ-risk Êφ(Y γ(z)) has the form: G 0 = z n φ(y i γ(z))q(z x i ) + λ 2 w 2 i=1 Challenge: even evaluating G 0 at a single point is intractable Requires summing over L S possible values for z Idea: approximate G 0 by another objective function G G is φ-risk of a marginalized feature space G 0 G for deterministic Q 13

Marginalizing over feature space z 2 γ(z) = w, Ψ(z) f Q (x) = w, Ψ Q (x) x 2 Q(z x) Quantized: Z space z 1 Original: X space x 1 Stochastic decision rule Q(z x): maps between X and Z induces marginalized feature map Ψ Q from base map Ψ (or marginalized kernel K Q from base kernel K) 14

Marginalized feature space {Ψ Q (x)} 15

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) 15-a

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 15-b

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 Ψ Q (x) induces a marginalized kernel over X: K Q (x, x ) := Ψ Q (x), Ψ Q (x ) = z,z Q(z x)q(z x ) K z (z, z ) Marginalization taken over message z conditioned on sensor signal x 15-c

Marginalized kernels Have been used to derive kernel functions from generative models (e.g. Tsuda, 2002) Marginalized kernel K Q (x, x ) is defined as: K Q (x, x ) := z,z Q(z x)q(z x ) }{{} Factorized distributions K z (z, z ), }{{} Base kernel If K z (z, z ) is decomposed into smaller components of z and z, then K Q (x, x ) can be computed efficiently (in polynomial-time) 16

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x 17

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z 17-a

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z Decentralized γ behaves on average like the centralized f Q : f Q (x) = E[γ(Z) x] 17-b

Optimization algorithm Goal: Solve the problem: inf G(w; Q) := ( φ y i w, w;q i z ) Q(z x i )Ψ(z) + λ 2 w 2 Finding optimal weight vector: G is convex in w with Q fixed solve dual problem (quadratic convex program) to obtain optimal w(q) Finding optimal decision rules: G is convex in Q t with w and all other {Q r, r t} fixed efficient computation of subgradient for G at optimal (w(q), Q) Overall: Efficient joint minimization by blockwise coordinate descent 18

Simulated sensor networks Y X 1 X 2 X 3 Y X 1 X 2 X 3 X 1 X 2 X 3 Y X 4 X 4...... X 10 X 10 X 4 X 5 X 6 X 7 X 8 X 9 Naive Bayes net Chain-structured network Spatially-dependent network 19

Kernel Quantization vs. Decentralized LRT 0.3 Naive Bayes networks 0.5 Chain structured (1st order) 0.4 Decentralized LRT 0.2 0.1 Decentralized LRT 0.3 0.2 0.1 0 0 0.1 0.2 0.3 Kernel quantization (KQ) 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 0.5 Chain structured (2nd order) 0.5 Fully connected networks 0.4 0.4 Decentralized LRT 0.3 0.2 Decentralized LRT 0.3 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 20

Wireless network with tiny Berkeley motes Light source............ sensors 5 5 = 25 tiny sensor motes, each equipped with a light receiver Light signal strength requires 10-bit ([0 1024] in magnitude) Perform classification with respect to different regions Each problem has 25 training positions, 81 test positions (Data collection courtesy Bruno Sinopoli) 21

0.7 0.6 Classification with Mica sensor motes centralized SVM (10 bit sig) centralized NB classifier decentralized KQ (1 bit) decentralized KQ (2 bit) 0.5 Test error 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 Classification problem instances 22

Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 23

Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) 24

Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) answers: hinge loss yields consistent estimates all losses corresponding to variational distance yield consistency and we can identify all of them exponential loss, logistic loss do not the gist lies in the correspondence between loss functions and divergence functionals 24-a

Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function z 25

Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function Kullback-Leibler divergence: f(u) = u log u. z I f (µ, π) = z µ(z)log µ(z) π(z). variational distance: f(u) = u 1. I f (µ, π) := z µ(z) π(z). Hellinger distance: f(u) = 1 2 ( u 1) 2. I f (µ, π) := ( z Z µ(z) π(z)) 2. 25-a

Surrogate loss and f-divergence Map Q induces measures on Z: µ(z) := P(Y = 1, z); π(z) := P(Y = 1, z) Theorem: Fixing Q, the optimal risk for each φ loss is an f-divergence for some convex f, and vice versa: R φ (Q) = I f (µ, π), where R φ (Q) := min γ Eφ(Y, γ(z)) φ 1 φ 2 φ 3 f 1 f 2 f 3 Class of loss functions Class of f-divergences 26

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π 27

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ 27-a

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ 27-b

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ 27-c

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ The last quantity can be viewed as a risk functional with respect to loss functions f (γ) and γ 27-d

Examples 0-1 loss: R bayes (Q) = 1 2 1 2 z Z µ(z) π(z) variational distance hinge loss: R hinge (Q) = 2R bayes (Q) variational distance exponential loss: R exp (Q) = 1 z Z (µ(z) 1/2 π(z) 1/2 ) 2 Hellinger distance logistic loss: R log (Q) = log 2 KL(µ µ+π 2 ) KL(π µ+π 2 ) capacitory dis. distance 28

Examples Equivalent surrogage losses corresponding to Hellinger distance (left) and variational distance (right) 4 3 g = exp(u 1) g = u g = u 2 4 3 g = e u 1 g = u g = u 2 φ loss 2 φ loss 2 1 1 0 1 0 1 2 margin 0 1 0 1 2 margin value the part after 0 is a fixed map of the part before 0! 29

Comparison of loss functions φ 1 φ 2 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergences two loss functions φ 1 and φ 2, corresponding to f-divergence induced by f 1 and f 2 φ 1 and φ 2 are universally equivalent, denoted by φ 1 U φ2 (or, equivalently) f 1 U f2 if for any P(X, Y ) and quantization rules Q A, Q B, there holds: R φ1 (Q A ) R φ1 (Q B ) R φ2 (Q A ) R φ2 (Q B ). 30

An equivalence theorem Theorem: φ 1 U φ2 (or, equivalently) f 1 U f2 if and only if f 1 (u) = cf 2 (u) + au + b for constants a, b R and c > 0 in particular, surrogate losses universally equivalent to 0 1 loss are those whose induced f divergence has the form: f(u) = c min{u, 1} + au + b 31

Empirical risk minimization procedure let φ be a convex surrogate equivalent to 0 1 loss (C n, D n ) is a sequence of increasing function classes for(γ, Q) our procedure learns: (C 1, D 1 ) (C 2, D 2 )... (Γ, Q) (γn, Q n) := argmin (γ,q) (Cn,D n )Êφ(Y γ(z)) let R bayes := inf (γ,q) (Γ,Q) P(Y γ(z)) optimal Bayes error our procedure is consistent if R bayes (γ n, Q n) R bayes 0 32

Consistency result Theorem: If n=1(c n, D n ) is dense in the space of pairs of classifier and quantizer (γ, Q) (Γ, Q) sequence (C n, D n ) increases in size sufficiently slowly then our procedure is consistent, i.e., lim n R bayes(γ n, Q n) R bayes = 0 in probability. proof exploits the equivalence of φ loss and 0 1 loss decomposition of φ risk into approximation error and estimation error 33

Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 34

Estimating divergence and likelihood ratio given i.i.d {x 1,...,x n } Q, {y 1,...,y n } P want to estimate two quantities KL divergence functional D K (P, Q) = p 0 log p 0 q 0 dµ likelihood ratio function g 0 (.) = p 0 (.)/q 0 (.) 35

Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional 36

Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional for the Kullback-Leibler divergence: D K (P, Q) = sup log g dp g>0 gdq + 1. furthermore, the supremum is attained at g = p 0 /q 0. 36-a

M-estimation procedure let G be a function class of X R + dp n and dq n denote the expectation under empirical measures P n and Q n, respectively our estimator has the following form: ˆD K = sup log g dp n g G gdq n + 1. supremum is attained at ĝ n, which estimates the likelihood ratio p 0 /q 0 37

Convex empirical risk with penalty in practice, control the size of the function class G by using penalty let I(g) be a measure of complexity for g decompose G as follows: G = 1 M G M, where G M is restricted to g for which I(g) M. the estimation procedure involves solving: ĝ n = argmin g G gdq n log g dp n + λ n 2 I2 (g). 38

Convergence analysis for KL divergence estimation, we study ˆD K D K (P, Q) for the likelihood ratio estimation, we use Hellinger distance h 2 Q(g, g 0 ) := 1 (g 1/2 g 1/2 0 ) 2 dq. 2 39

Assumptions for convergence analysis true likelihood ratio g 0 is bounded from below by some positive constant: g 0 η 0 > 0. Note: we don t assume that G is bounded away from 0 (not yet)! uniform norm of G M is Lipchitz with respect to the penalty measure I(g): for any M 1: sup g cm. g G M on the entropy of G: For some 0 < γ < 2, H B δ (G M, L 2 (Q)) = O(M/δ) γ. 40

Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). 41

Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). if G is bounded away from 0: D K ˆD K = O P (λ 1/2 n )(1 + I(g 0 )). 41-a

{x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H 42

{x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H Convex dual formulation: α := argmax 1 n log α j 1 n 2λ n j=1 n α j Φ(y j ) 1 n j=1 n Φ(x i ) 2 i=1 ˆD K (P, Q) := 1 n n log α j log n j=1 42-a

log G is RKHS function class {x i } Q, {y j } P log G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = log g H ĝ n = argmin g G gdq n log g dp n + λ n 2 log g 2 H Convex dual formulation: α := argmax 1 n ( α i log α i +α i log n n e i=1 ) 1 2λ n n α i Φ(x i ) 1 n i=1 n Φ(y j ) 2 j=1 ˆD K (P, Q) := 1 + n α i log α i + α i log n e i=1 43

Estimate of KL(Beta(1,2),Unif[0,1]) 0.8 Estimate of KL(1/2 N t (0,1)+ 1/2 N t (1,1),Unif[ 5,5]) 0.4 0.3 0.2 0.1 0 0.1 0.1931 M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/2 WKV, s = n 1/3 0 100 200 500 1000 2000 5000 10000 20000 50000 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.414624 M1, σ =.1, λ = 1/n M2, σ = 1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/3 100 200 500 1000 2000 5000 10000 Estimate of KL(N t (0,1),N t (4,2)) Estimate of KL(N t (4,2),N t (0,1)) 2.5 6 5 2 4 1.5 3 2 1 0.5 0 1.9492 M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/3 100 200 500 1000 2000 5000 10000 1 0 1 2 4.72006 M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, s = n 1/4 WKV, s = n 1/3 WKV, s = n 1/2 100 200 500 1000 2000 5000 10000 44

1.5 Estimate of KL(N t (0,I 2 ),N t (1,I 2 )) 1.5 Estimate of KL(N t (0,I 2 ),Unif[ 3,3] 2 ) 1 1 0.5 0 0.959316 M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/2 0 100 200 500 1000 2000 5000 10000 0.5 0.777712 M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/2 100 200 500 1000 2000 5000 10000 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 Estimate of KL(N t (0,I 3 ),N t (1,I 3 )) 100 200 500 1000 2000 5000 10000 1.43897 0.5 M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, n 1/2 0 WKV, n 1/3 2 1.5 1 Estimate of KL(N t (0,I 3 ),Unif[ 3,3] 3 ) 1.16657 M1 σ = 1, λ =.1/n 1/2 M2, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n 2/3 WKV, n 1/3 WKV, n 1/2 100 200 500 1000 2000 5000 10000 45

Conclusion nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 46