Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

Talk outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels use of convex analysis study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 2

Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) roles of experiment Q: X Q Z γ Y due to data collection constraints (e.g., decentralization) data transmission constraints choice of variates (feature selection) dimensionality reduction scheme 3-a

A decentralized detection system... Sensors Communication channel Fusion center Decision rule Decentralized setting: Communication constraints between sensors and fusion center (e.g., bit constraints) Goal: Design decision rules for sensors and fusion center Criterion: Minimize probability of incorrect detection 4

Concrete example wireless sensor network Light source............ sensors Set-up: wireless network of tiny sensor motes, each equiped with light/ humidity/ temperature sensing capabilities measurement of signal strength ([0 1024] in magnitude, or 10 bits) Goal: is there a forest fire in a certain region? 5

Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure 6

Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure Decentralized detection in signal processing (e.g., Tsitsiklis, 1993) joint distribution assumed to be known locally-optimal rules under conditional independence assumptions (i.e., Naive Bayes) 6-a

Overview of our approach Treat as a nonparametric estimation (learning) problem under constraints from a distributed system Use kernel methods and convex surrogate loss functions tools from convex optimization to derive an efficient algorithm 7

Problem set-up Y {±1} X 1 X 2 X 3... X S γ 1 γ 2 γ 3... γ S Z 1 Z 2 Z 3... Z S X {1,...,M} S Z {1,...,L} S ; L M γ(z 1,...,Z S ) Problem: Given training data (x i, y i ) n i=1, find the decision rules (γ 1,...,γ s ; γ) so as to minimize the detection error probability: P(Y γ(z 1,...,Z s )). 8

Kernel methods for classification Classification: Learn γ(z) that predicts label y K(z, z ) is a symmetric positive semidefinite kernel function feature space H in which K acts as an inner product, i.e., K(z, z ) = Ψ(z), Ψ(z ) Kernel-based algorithm finds linear function in H, i.e. γ(z) = w, Ψ(z) = n α i K(z i, z) i=1 Advantages: kernel function classes are sufficiently rich for many applications optimizing over kernel function classes is computionally efficient 9

Convex surrogate loss function φ to 0-1 loss 3 2.5 Zero one Hinge Logistic Exponential Surrogate loss 2 1.5 1 0.5 0 1 0.5 0 0.5 1 1.5 2 Margin value minimizing (regularized) empirical φ-risk Êφ(Y γ(z)): n φ(y i γ(z i )) + λ 2 γ 2, min γ H i=1 (z i, y i ) n i=1 are training data in Z {±1} φ is a convex loss function (surrogate to non-convex 0-1 loss) 10

Stochastic decision rules at each sensor Y X 1 Z 1 X 2 Z 2 X 3 Z 3... X S...... Z S γ(z) Q(Z X) K z (z, z ) = Ψ(z), Ψ(z ) γ(z) = w, Ψ(z) Approximate deterministic sensor decisions by stochastic rules Q(Z X) Sensors do not communicate directly = factorization: Q(Z X) = S t=1 Qt (Z t X t ) Q = Q t, The overall decision rule is represented by γ(z) = w, Ψ(z) 11

High-level strategy: Joint optimization Minimize over (Q, γ) an empirical version of Eφ(Y γ(z)) Joint minimization: fix Q, optimize over γ: A simple convex problem fix γ, perform a gradient update for Q, sensor by sensor 12

Approximating empirical φ-risk The regularized empirical φ-risk Êφ(Y γ(z)) has the form: G 0 = z n φ(y i γ(z))q(z x i ) + λ 2 w 2 i=1 Challenge: even evaluating G 0 at a single point is intractable Requires summing over L S possible values for z Idea: approximate G 0 by another objective function G G is φ-risk of a marginalized feature space G 0 G for deterministic Q 13

Marginalizing over feature space z 2 γ(z) = w, Ψ(z) f Q (x) = w, Ψ Q (x) x 2 Q(z x) Quantized: Z space z 1 Original: X space x 1 Stochastic decision rule Q(z x): maps between X and Z induces marginalized feature map Ψ Q from base map Ψ (or marginalized kernel K Q from base kernel K) 14

Marginalized feature space {Ψ Q (x)} 15

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) 15-a

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 Ψ Q (x) induces a marginalized kernel over X: K Q (x, x ) := Ψ Q (x), Ψ Q (x ) = z,z Q(z x)q(z x ) K z (z, z ) Marginalization taken over message z conditioned on sensor signal x 15-c

Marginalized kernels Have been used to derive kernel functions from generative models (e.g. Tsuda, 2002) Marginalized kernel K Q (x, x ) is defined as: K Q (x, x ) := z,z Q(z x)q(z x ) }{{} Factorized distributions K z (z, z ), }{{} Base kernel If K z (z, z ) is decomposed into smaller components of z and z, then K Q (x, x ) can be computed efficiently (in polynomial-time) 16

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x 17

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z 17-a

Optimization algorithm Goal: Solve the problem: inf G(w; Q) := ( φ y i w, w;q i z ) Q(z x i )Ψ(z) + λ 2 w 2 Finding optimal weight vector: G is convex in w with Q fixed solve dual problem (quadratic convex program) to obtain optimal w(q) Finding optimal decision rules: G is convex in Q t with w and all other {Q r, r t} fixed efficient computation of subgradient for G at optimal (w(q), Q) Overall: Efficient joint minimization by blockwise coordinate descent 18

Simulated sensor networks Y X 1 X 2 X 3 Y X 1 X 2 X 3 X 1 X 2 X 3 Y X 4 X 4...... X 10 X 10 X 4 X 5 X 6 X 7 X 8 X 9 Naive Bayes net Chain-structured network Spatially-dependent network 19

Kernel Quantization vs. Decentralized LRT 0.3 Naive Bayes networks 0.5 Chain structured (1st order) 0.4 Decentralized LRT 0.2 0.1 Decentralized LRT 0.3 0.2 0.1 0 0 0.1 0.2 0.3 Kernel quantization (KQ) 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 0.5 Chain structured (2nd order) 0.5 Fully connected networks 0.4 0.4 Decentralized LRT 0.3 0.2 Decentralized LRT 0.3 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 20

Wireless network with tiny Berkeley motes Light source............ sensors 5 5 = 25 tiny sensor motes, each equipped with a light receiver Light signal strength requires 10-bit ([0 1024] in magnitude) Perform classification with respect to different regions Each problem has 25 training positions, 81 test positions (Data collection courtesy Bruno Sinopoli) 21

0.7 0.6 Classification with Mica sensor motes centralized SVM (10 bit sig) centralized NB classifier decentralized KQ (1 bit) decentralized KQ (2 bit) 0.5 Test error 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 Classification problem instances 22

Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 23

Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) 24

Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) answers: hinge loss yields consistent estimates all losses corresponding to variational distance yield consistency and we can identify all of them exponential loss, logistic loss do not the gist lies in the correspondence between loss functions and divergence functionals 24-a

Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function Kullback-Leibler divergence: f(u) = u log u. z I f (µ, π) = z µ(z)log µ(z) π(z). variational distance: f(u) = u 1. I f (µ, π) := z µ(z) π(z). Hellinger distance: f(u) = 1 2 ( u 1) 2. I f (µ, π) := ( z Z µ(z) π(z)) 2. 25-a

Surrogate loss and f-divergence Map Q induces measures on Z: µ(z) := P(Y = 1, z); π(z) := P(Y = 1, z) Theorem: Fixing Q, the optimal risk for each φ loss is an f-divergence for some convex f, and vice versa: R φ (Q) = I f (µ, π), where R φ (Q) := min γ Eφ(Y, γ(z)) φ 1 φ 2 φ 3 f 1 f 2 f 3 Class of loss functions Class of f-divergences 26

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π 27

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ 27-a

Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ 27-c

Examples 0-1 loss: R bayes (Q) = 1 2 1 2 z Z µ(z) π(z) variational distance hinge loss: R hinge (Q) = 2R bayes (Q) variational distance exponential loss: R exp (Q) = 1 z Z (µ(z) 1/2 π(z) 1/2 ) 2 Hellinger distance logistic loss: R log (Q) = log 2 KL(µ µ+π 2 ) KL(π µ+π 2 ) capacitory dis. distance 28

Examples Equivalent surrogage losses corresponding to Hellinger distance (left) and variational distance (right) 4 3 g = exp(u 1) g = u g = u 2 4 3 g = e u 1 g = u g = u 2 φ loss 2 φ loss 2 1 1 0 1 0 1 2 margin 0 1 0 1 2 margin value the part after 0 is a fixed map of the part before 0! 29

Comparison of loss functions φ 1 φ 2 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergences two loss functions φ 1 and φ 2, corresponding to f-divergence induced by f 1 and f 2 φ 1 and φ 2 are universally equivalent, denoted by φ 1 U φ2 (or, equivalently) f 1 U f2 if for any P(X, Y ) and quantization rules Q A, Q B, there holds: R φ1 (Q A ) R φ1 (Q B ) R φ2 (Q A ) R φ2 (Q B ). 30

An equivalence theorem Theorem: φ 1 U φ2 (or, equivalently) f 1 U f2 if and only if f 1 (u) = cf 2 (u) + au + b for constants a, b R and c > 0 in particular, surrogate losses universally equivalent to 0 1 loss are those whose induced f divergence has the form: f(u) = c min{u, 1} + au + b 31

Empirical risk minimization procedure let φ be a convex surrogate equivalent to 0 1 loss (C n, D n ) is a sequence of increasing function classes for(γ, Q) our procedure learns: (C 1, D 1 ) (C 2, D 2 )... (Γ, Q) (γn, Q n) := argmin (γ,q) (Cn,D n )Êφ(Y γ(z)) let R bayes := inf (γ,q) (Γ,Q) P(Y γ(z)) optimal Bayes error our procedure is consistent if R bayes (γ n, Q n) R bayes 0 32

Consistency result Theorem: If n=1(c n, D n ) is dense in the space of pairs of classifier and quantizer (γ, Q) (Γ, Q) sequence (C n, D n ) increases in size sufficiently slowly then our procedure is consistent, i.e., lim n R bayes(γ n, Q n) R bayes = 0 in probability. proof exploits the equivalence of φ loss and 0 1 loss decomposition of φ risk into approximation error and estimation error 33

Estimating divergence and likelihood ratio given i.i.d {x 1,...,x n } Q, {y 1,...,y n } P want to estimate two quantities KL divergence functional D K (P, Q) = p 0 log p 0 q 0 dµ likelihood ratio function g 0 (.) = p 0 (.)/q 0 (.) 35

Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional 36

Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional for the Kullback-Leibler divergence: D K (P, Q) = sup log g dp g>0 gdq + 1. furthermore, the supremum is attained at g = p 0 /q 0. 36-a

M-estimation procedure let G be a function class of X R + dp n and dq n denote the expectation under empirical measures P n and Q n, respectively our estimator has the following form: ˆD K = sup log g dp n g G gdq n + 1. supremum is attained at ĝ n, which estimates the likelihood ratio p 0 /q 0 37

Convex empirical risk with penalty in practice, control the size of the function class G by using penalty let I(g) be a measure of complexity for g decompose G as follows: G = 1 M G M, where G M is restricted to g for which I(g) M. the estimation procedure involves solving: ĝ n = argmin g G gdq n log g dp n + λ n 2 I2 (g). 38

Convergence analysis for KL divergence estimation, we study ˆD K D K (P, Q) for the likelihood ratio estimation, we use Hellinger distance h 2 Q(g, g 0 ) := 1 (g 1/2 g 1/2 0 ) 2 dq. 2 39

Assumptions for convergence analysis true likelihood ratio g 0 is bounded from below by some positive constant: g 0 η 0 > 0. Note: we don t assume that G is bounded away from 0 (not yet)! uniform norm of G M is Lipchitz with respect to the penalty measure I(g): for any M 1: sup g cm. g G M on the entropy of G: For some 0 < γ < 2, H B δ (G M, L 2 (Q)) = O(M/δ) γ. 40

Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). 41

Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). if G is bounded away from 0: D K ˆD K = O P (λ 1/2 n )(1 + I(g 0 )). 41-a

{x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H 42

{x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H Convex dual formulation: α := argmax 1 n log α j 1 n 2λ n j=1 n α j Φ(y j ) 1 n j=1 n Φ(x i ) 2 i=1 ˆD K (P, Q) := 1 n n log α j log n j=1 42-a

log G is RKHS function class {x i } Q, {y j } P log G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = log g H ĝ n = argmin g G gdq n log g dp n + λ n 2 log g 2 H Convex dual formulation: α := argmax 1 n ( α i log α i +α i log n n e i=1 ) 1 2λ n n α i Φ(x i ) 1 n i=1 n Φ(y j ) 2 j=1 ˆD K (P, Q) := 1 + n α i log α i + α i log n e i=1 43

Estimate of KL(Beta(1,2),Unif[0,1]) 0.8 Estimate of KL(1/2 N t (0,1)+ 1/2 N t (1,1),Unif[ 5,5]) 0.4 0.3 0.2 0.1 0 0.1 0.1931 M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/2 WKV, s = n 1/3 0 100 200 500 1000 2000 5000 10000 20000 50000 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.414624 M1, σ =.1, λ = 1/n M2, σ = 1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/3 100 200 500 1000 2000 5000 10000 Estimate of KL(N t (0,1),N t (4,2)) Estimate of KL(N t (4,2),N t (0,1)) 2.5 6 5 2 4 1.5 3 2 1 0.5 0 1.9492 M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/3 100 200 500 1000 2000 5000 10000 1 0 1 2 4.72006 M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, s = n 1/4 WKV, s = n 1/3 WKV, s = n 1/2 100 200 500 1000 2000 5000 10000 44

1.5 Estimate of KL(N t (0,I 2 ),N t (1,I 2 )) 1.5 Estimate of KL(N t (0,I 2 ),Unif[ 3,3] 2 ) 1 1 0.5 0 0.959316 M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/2 0 100 200 500 1000 2000 5000 10000 0.5 0.777712 M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/2 100 200 500 1000 2000 5000 10000 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 Estimate of KL(N t (0,I 3 ),N t (1,I 3 )) 100 200 500 1000 2000 5000 10000 1.43897 0.5 M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, n 1/2 0 WKV, n 1/3 2 1.5 1 Estimate of KL(N t (0,I 3 ),Unif[ 3,3] 3 ) 1.16657 M1 σ = 1, λ =.1/n 1/2 M2, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n 2/3 WKV, n 1/3 WKV, n 1/2 100 200 500 1000 2000 5000 10000 45

Conclusion nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 46