Decentralized decision making with spatially distributed data

Similar documents
Surrogate loss functions, divergences and decentralized detection

Simultaneous and sequential detection of multiple interacting change points

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

Divergences, surrogate loss functions and experimental design

Are You a Bayesian or a Frequentist?

On surrogate loss functions and f-divergences

Decentralized Detection and Classification using Kernel Methods

Nonparametric Decentralized Detection using Kernel Methods

ADECENTRALIZED detection system typically involves a

On divergences, surrogate loss functions, and decentralized detection

AdaBoost and other Large Margin Classifiers: Convexity in Classification

ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1

Support Vector Machines

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Decentralized Detection and Classification using Kernel Methods

On distance measures, surrogate loss functions, and distributed detection

Statistical Properties of Large Margin Classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression. Machine Learning Fall 2018

Lecture 10 February 23

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 22: Error exponents in hypothesis testing, GLRT

Statistical Data Mining and Machine Learning Hilary Term 2016

13: Variational inference II

Support Vector Machines and Bayes Regression

Discriminative Models

Distributed Online Simultaneous Fault Detection for Multiple Sensors

CSCI-567: Machine Learning (Spring 2019)

The Bayes classifier

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Deep Learning for Computer Vision

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

10-701/ Machine Learning - Midterm Exam, Fall 2010

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

18.9 SUPPORT VECTOR MACHINES

Empirical Risk Minimization

Midterm exam CS 189/289, Fall 2015

Posterior Regularization

Accelerated Training of Max-Margin Markov Networks with Kernels

Pattern Recognition and Machine Learning

Machine Learning for NLP

Introduction to Machine Learning Midterm, Tues April 8

Discriminative Models

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

A Study of Relative Efficiency and Robustness of Classification Methods

Stanford Statistics 311/Electrical Engineering 377

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning 2017

Statistical Machine Learning Hilary Term 2018

Does Modeling Lead to More Accurate Classification?

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

STA 4273H: Statistical Machine Learning

Parameter learning in CRF s

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CMU-Q Lecture 24:

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Logistic Regression and Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 3: Statistical Decision Theory (Part II)

Nonparametric Bayesian Methods (Gaussian Processes)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Bayesian Machine Learning

CSC 411 Lecture 17: Support Vector Machine

Qualifying Exam in Machine Learning

Classification objectives COMS 4771

Support Vector Machines and Kernel Methods

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Logistic Regression Trained with Different Loss Functions. Discussion

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Methods and Support Vector Machines

Probabilistic Graphical Models

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Learning Binary Classifiers for Multi-Class Problem

Classification with Reject Option

9 Classification. 9.1 Linear Classifiers

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 9: PGM Learning

Approximation Theoretical Questions for SVMs

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Decentralized Detection In Wireless Sensor Networks

CS229 Supplemental Lecture notes

Machine Learning. Support Vector Machines. Manfred Huber

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CPSC 540: Machine Learning

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Support Vector Machines, Kernel SVM

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Transcription:

Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya 1

Decentralized systems and spatial data Many applications and systems involve collecting and transmitting large volume of data through distributed network (sensor signals, image streams, network system logs, etc) Two interacting and conflicting forces statistical inference and learning arise from spatial dependence decentralized communication and computations 2

Decentralized systems and spatial data Many applications and systems involve collecting and transmitting large volume of data through distributed network (sensor signals, image streams, network system logs, etc) Two interacting and conflicting forces statistical inference and learning arise from spatial dependence decentralized communication and computations Extensive literature dealing with each of these two aspects separately We are interested in decentralized learning and decision-making methods for spatially distributed data computation/communication efficiency vs. statistical efficiency 2-a

Example 1 sensor network for detection Light source............ sensors Set-up: Wireless network of tiny sensor motes, each equiped with light/ humidity/ temperature sensing capabilities Measurement of signal strength ([0 1024] in magnitude, or 10 bits) Common goal: Is the light source inside the green region or not? 3

Example 2 sensor network for traffic monitoring Multiple goals: Different sensors measuring different locations 4

Two types of set-ups aggregation of data to make a good decision toward a common goal all sensors collect measurements of the same phenomenon and report their messages to a fusion center completely distributed network of sensors each making separate decisions for own goal different sensors have statistically dependent measurements about one or more phenomena of interest 5

Talk outline Set-up 1: decentralized detection (classification) problem algorithmic and modeling ideas (marginalized kernels, convex optimization) statistical properties (use of surrogate loss and f-divergence) Set-up 2: completely distributed decision-making for multiple sensors algorithmic ideas (message-passing in graphical models) statistical tools (from sequential analysis) 6

A decentralized detection system... Sensors Communication channel Fusion center Decision rule Decentralized setting: Communication constraints between sensors and fusion center (e.g., bit constraints) Goal: Design decision rules for sensors and fusion center Criterion: Minimize probability of incorrect detection 7

Problem set-up Y {±1} X 1 X 2 X 3... X S Q 1 Q 2 Q 3... Q S Z 1 Z 2 Z 3... Z S X {1,...,M} S Z {1,...,L} S ; L M γ(z 1,...,Z S ) Problem: Given training data (x i, y i ) n i=1, find the decision rules (Q 1,...,Q s ; γ) so as to minimize the detection error probability: P(Y γ(z 1,...,Z s )). 8

Decentralized detection General set-up: data are (X, Y ) pairs, assumed iid for simplicity, where Y {0, 1} given X, let Z = Q(X) denote the covariate vector, where Q Q Q is some set of random mappings, namely, quantizers a family of {γ( )}, where γ is a discriminant decision function lying in some (nonparametric) family Γ Problem: Find decision (Q, γ) that minimizes the probability of error P(Y γ(z)) Many problems have similar formulation: decentralized compression and detection feature selection, dimensionality reduction problem of sensor placement 9

Signal processing literature Perspectives everything is assumed known except for Q the problem is to find Q subject to network system constraints maximization of an f-divergence (e.g., Hellinger distance, Chernoff distance) basically a heuristic literature from a statistical perspective (plug-in estimation) supporting arguments from asymptotics Statistical learning literature decision-theoretic flavor Q is assumed known and the problem is to find γ this is done via minimization of a surrogate convex loss (e.g., boosting, logistic regression, support vector machine) 10

Overview of our approach Treat as a nonparametric joint learning problem estimate both Q and γ subject to constraints from a distributed system Use kernel methods and convex surrogate loss functions tools from convex optimization to derive an efficient algorithm Exploit a correspondence between surrogate losses and divergence functionals obtains consistency of learning procedure 11

Kernel methods for classification Classification: Learn γ(z) that predicts label y K(z, z ) is a symmetric positive semidefinite kernel function natural choice of basis function for spatially distributed data feature space H in which K acts as an inner product, i.e., K(z, z ) = Ψ(z), Ψ(z ) Kernel-based algorithm finds linear function in H, i.e. Advantages: γ(z) = w, Ψ(z) optimizing over kernel function classes is computionally efficient solution γ is represented in terms of kernels only: γ(z) = n α i K(z i, z) i=1 12

Convex surrogate loss functions φ(α) 3 2.5 Zero one Hinge Logistic Exponential Surrogate loss 2 1.5 1 0.5 0 1 0.5 0 0.5 1 1.5 2 Margin value minimizing (regularized) empirical φ-risk Êφ(Y γ(z)): n φ(y i γ(z i )) + λ 2 γ 2, min γ H i=1 (z i, y i ) n i=1 are training data in Z {±1} φ is a convex loss function (upper bound of 0-1 loss) 13

Stochastic decision rules at each sensor Y X 1 Z 1 X 2 Z 2 X 3 Z 3... X S...... Z S γ(z) Q(Z X) 8 < K z (z, z ) = Ψ(z), Ψ(z ) : γ(z) = w, Ψ(z) Approximate deterministic sensor decisions by stochastic rules Q(Z X) Sensors do not communicate directly = factorization: Q(Z X) = S t=1 Qt (Z t X t ) Q = Q t, The overall decision rule is represented by γ(z) = w, Ψ(z) 14

High-level strategy: Joint optimization Minimize over (Q, γ) an empirical version of Eφ(Y γ(z)) Joint minimization: fix Q, optimize over γ: A simple convex problem fix γ, perform a gradient update for Q, sensor by sensor 15

High-level strategy: Space of stochastic quantization rule Q Q a Q 0 is convex hull of the set of deterministic Q optimal decision rule Q 0 is deterministic optimizing over deterministic rules is NP-hard 16

High-level strategy: Alternative objective function Obj. fun. Original fun.: inf γ Eφ(Y γ(z)) Extreme point Alternative fun.: inf γ Eφ(Y f Q (X)) Q 0 Q a Q 17

Approximating empirical φ-risk The regularized empirical φ-risk Êφ(Y γ(z)) has the form: G 0 = z n φ(y i γ(z))q(z x i ) + λ 2 w 2 i=1 Challenge: Even evaluating G 0 at a single point is intractable Requires summing over L S possible values for z Idea: Approximate G 0 by another objective function G G 0 G for deterministic Q 18

Marginalizing over feature space z 2 γ(z) = w, Ψ(z) f Q (x) = w, Ψ Q (x) x 2 Q(z x) Quantized: Z space z 1 Original: X space x 1 Stochastic decision rule Q(z x): maps between X and Z induces marginalized feature map Ψ Q from base map Ψ (or marginalized kernel K Q from base kernel K) 19

Marginalized feature space {Ψ Q (x)} 20

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) 20-a

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 20-b

Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 Ψ Q (x) induces a marginalized kernel over X: K Q (x, x ) := Ψ Q (x), Ψ Q (x ) = z,z Q(z x)q(z x ) K z (z, z ) Marginalization taken over message z conditioned on sensor signal x 20-c

Marginalized kernels Have been used to derive kernel functions from generative models (e.g. Tsuda, 2002) Marginalized kernel K Q (x, x ) is defined as: K Q (x, x ) := z,z Q(z x)q(z x ) }{{} Factorized distributions K z (z, z ), }{{} Base kernel If K z (z, z ) is decomposed into smaller components of z and z, then K Q (x, x ) can be computed efficiently (in polynomial-time) 21

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x 22

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z 22-a

Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z Decentralized γ behaves on average like the centralized f Q : f Q (x) = E[γ(Z) x] 22-b

Optimization algorithm Goal: Solve the problem: inf G(w; Q) := ( φ y i w, w;q i z ) Q(z x i )Ψ(z) + λ 2 w 2 Finding optimal weight vector: G is convex in w with Q fixed solve dual problem (quadratically-constrained convex program) to obtain optimal w(q) Finding optimal decision rules: G is convex in Q t with w and all other {Q r, r t} fixed efficient computation of subgradient for G at optimal (w(q), Q) Overall: Efficient joint minimization by blockwise coordinate descent 23

Wireless network with Mica motes Light source............ sensors 5 5 = 25 tiny sensor motes, each equipped with a light receiver Light signal strength requires 10-bit ([0 1024] in magnitude) Perform classification with respect to different regions, subject to bit constraints Each problem has 25 training positions, 81 test positions 24

Wireless sensor network data (light signal) 1200 Signal strength received by base sensor No. 1 1200 Signal strength received by base sensor No. 6 1000 1000 Signal strength 800 600 400 Signal strength 800 600 400 200 200 0 0 10 20 30 40 50 60 Distance 0 0 5 10 15 20 25 30 35 40 45 50 Distance 1200 Signal strength received by base sensor No. 15 1200 Distance Signal strength Mapping 1000 1000 Signal strength 800 600 400 Signal strength 800 600 400 200 200 0 0 5 10 15 20 25 30 35 40 45 Distance 0 0 10 20 30 40 50 60 Pairwise sensor distance (inch) 25

Location estimation result Localization with third signal kernel 40 73 74 75 76 77 78 79 80 81 21 22 23 24 25 35 64 65 66 67 68 69 70 71 72 30 55 56 57 58 59 60 61 62 63 16 17 18 19 20 25 46 47 48 49 50 51 52 53 54 20 37 38 39 40 41 42 43 44 45 11 12 13 14 15 15 28 29 30 31 32 33 34 35 36 10 19 20 21 22 23 24 25 26 27 6 7 8 9 10 5 10 11 12 13 14 15 16 17 18 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 5 0 5 10 15 20 25 30 35 40 Localization Error Mean = 3.53, Median = 2.63, Std = 3.50 compare to a well-known range-based method: (6.99, 5.28, 5.79) 26

Location estimation result (existing method) Ranging localizing Algorithm 40 73 74 75 76 77 78 79 80 81 21 22 23 24 25 35 64 65 66 67 68 69 70 71 72 30 55 56 57 58 59 60 61 62 63 16 17 18 19 20 25 46 47 48 49 50 51 52 53 54 20 37 38 39 40 41 42 43 44 45 11 12 13 14 15 15 28 29 30 31 32 33 34 35 36 10 19 20 21 22 23 24 25 26 27 6 7 8 9 10 5 10 11 12 13 14 15 16 17 18 0 1 2 3 4 5 6 7 8 9 1 0 5 2 10 15 3 20 25 4 30 35 5 40 Localization error: Mean = 6.99, Med = 5.28, Std = 5.79 compare to our kernel-based learning method: (3.53, 2.63, 3.50) 27

0.7 0.6 Classification with Mica sensor motes centralized SVM (10 bit sig) centralized NB classifier decentralized KQ (1 bit) decentralized KQ (2 bit) 0.5 Test error 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 Classification problem instances 28

Simulated sensor networks Y X 1 X 2 X 3 Y X 1 X 2 X 3 X 1 X 2 X 3 Y X 4 X 4...... X 10 X 10 X 4 X 5 X 6 X 7 X 8 X 9 Naive Bayes net Chain-structured network Spatially-dependent network 29

Joint estimation method vs. decentralized LRT 0.3 Naive Bayes networks 0.5 Chain structured (1st order) 0.4 Decentralized LRT 0.2 0.1 Decentralized LRT 0.3 0.2 0.1 0 0 0.1 0.2 0.3 Kernel quantization (KQ) 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 0.5 Chain structured (2nd order) 0.5 Fully connected networks 0.4 0.4 Decentralized LRT 0.3 0.2 Decentralized LRT 0.3 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 0 0 0.1 0.2 0.3 0.4 0.5 Kernel quantization (KQ) 30

Talk outline decentralized detection (classification) problem algorithmic and modeling ideas (marginalized kernels, convex optimization) statistical properties (use of surrogate loss and f-divergence) completely distributed decision making for multiple sensors algorithmic ideas (message-passing in graphical models) statistical tools (from sequential analysis) 31

Statistical properties of surrogate losses recall that our algorithm essentially solves min γ,q Eφ(Y, γ(z)) does this also implies optimality in the sense of 0-1 loss? the answer lies in the correspondence between loss functions and divergence functionals 32

Intuitions about loss functions and divergences loss functions quantify our decision rules the sensor messages, and the classifier at the fusion center divergences quantify the distance (separation) between two probability distributions (populations of data) the best sensor messages and classifier is the one that best separate the two populations of data (corresponding to two class label Y = {±1}) thus, loss functions and divergences are dual of one another: minimize a loss function is equivalent to maximizing an associated divergence 33

f-divergence (Ali-Silvey Distance) The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function z 34

f-divergence (Ali-Silvey Distance) The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function z Kullback-Leibler divergence: f(u) = u log u. I f (µ, π) = Z z µ(z)log µ(z) π(z). variational distance: f(u) = u 1. I f (µ, π) := Z z µ(z) π(z). Hellinger distance: f(u) = 1 2 ( u 1) 2. I f (µ, π) := Z z Z ( p µ(z) p π(z)) 2. 34-a

Surrogate loss and f-divergence φ 1 φ 2 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergences 35

Surrogate loss and f-divergence φ 1 φ 2 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergences Measures on Z associated with Y = 1 and Y = 1: µ(z) := P(Y = 1, z) π(z) := P(Y = 1, z) Fixing Q, define the optimal risk for each φ loss by optimizing over discriminant decision function γ: R φ (Q) := min γ Eφ(Y, γ(z)) 35-a

Link between φ-losses and f-divergences Theorem: (Nguyen et al, 2009) (a) For any surrogate loss φ, there is an f-divergence for some lowersemicontinuous convex f such that R φ (Q) = I f (µ, π). In addition, if φ is continuous and satisfies a (weak) regularity condition, f has to satisfy a number of conditions A. (b) Conversely, if a convex f satisfies conditions A, there exists a convex surrogate loss φ that induces the corresponding f-divergence. 36

Link between φ-losses and f-divergences Theorem: (Nguyen et al, 2009) (a) For any surrogate loss φ, there is an f-divergence for some lowersemicontinuous convex f such that R φ (Q) = I f (µ, π). In addition, if φ is continuous and satisfies a (weak) regularity condition, f has to satisfy a number of conditions A. (b) Conversely, if a convex f satisfies conditions A, there exists a convex surrogate loss φ that induces the corresponding f-divergence. the correspondence stems from a convex duality relationship we can construct all surrogate loss functions φ that induce the f-divergence φ is parametrized using the conjugate dual of f 36-a

Examples of surrogate losses for a given f-divergence Left: corresponding to Hellinger distance, including φ(α) = exp( α) (in boosting algorithm) φ loss 4 3 2 g = exp(u 1) g = u g = u 2 φ loss 4 3 2 g = e u 1 g = u g = u 2 φ loss 4 3 2 1 φ(α) = e α α φ(α) = 1(α < 0) 1 0 1 0 1 2 margin 1 0 1 0 1 2 margin value 0 1 2 1 0 1 2 margin value (α) Middle: corresponding to variational distance, including φ(α) = (1 α) + (in support vector machine) and the 0-1 loss Right: corresponding to symmetric KL divergence, including φ(α) = e α α 1 37

A theory of equivalent surrogate loss functions two loss functions φ 1 and φ 2, corresponding to f-divergences induced by f 1 and f 2 φ 1 and φ 2 are universally equivalent if for any P(X, Y ) and mapping rules Q A, Q B, there holds: R φ1 (Q A ) R φ1 (Q B ) R φ2 (Q A ) R φ2 (Q B ). 38

A theory of equivalent surrogate loss functions two loss functions φ 1 and φ 2, corresponding to f-divergences induced by f 1 and f 2 φ 1 and φ 2 are universally equivalent if for any P(X, Y ) and mapping rules Q A, Q B, there holds: R φ1 (Q A ) R φ1 (Q B ) R φ2 (Q A ) R φ2 (Q B ). Theorem 3: φ 1 and φ 2 are universally equivalent if and only if for constants a, b R and c > 0 f 1 (u) = cf 2 (u) + au + b this result extends a theorem of Blackwell s, which is concerned only with f-divergences and the 0-1 loss, not the surrogate loss functions 38-a

Empirical risk minimization procedure let φ be a convex surrogate equivalent to 0 1 loss (C n, D n ) is a sequence of increasing function classes for (γ, Q) given i.i.d. data pairs (X i, Y i ) n i=1 our procedure learns: (γn, Q n) := argmin (γ,q) (Cn,D n )Êφ(Y γ(z)) let R bayes := inf (γ,q) (Γ,Q) P(Y γ(z)) optimal Bayes error our procedure is Bayes-consistent if R bayes (γ n, Q n) R bayes 0 39

Bayes consistency Theorem: If n=1(c n, D n ) is dense in the space of pairs of decision rules (γ, Q) sequence (C n, D n ) increases in size sufficiently slowly then our procedure is consistent, i.e., lim n R bayes(γ n, Q n) R bayes = 0 in probability. proof exploits the developed equivalence of φ loss and 0 1 loss decomposition of φ risk into approximation error and estimation error 40

Brief summary... Sensors Decision rule Communication channel Fusion center Joint estimation: over the space of sensor messages, and over the space of classifier at the fusion center subject to communication constraints Challenges: the space of sensor messages is large, requiring better understanding of optimal messages evaluation of risk function is hard, requiring approximation methods underlying problem is non-convex, requiring clever convexification 41

Other formulations of aggregation in decentralized systems moving from binary decision to multi-category decision (on-going work) accounting for sequential aspect of data (Nguyen et al, 2008) 42

Talk outline Set-up 1: decentralized detection (classification) problem algorithmic and modeling ideas (marginalized kernels, convex optimization) statistical properties (use of surrogate loss and f-divergence) Set-up 2: completely distributed decision-making for multiple sensors algorithmic ideas (message-passing in graphical models) statistical tools (from sequential analysis) 43

Failure detection for multiple sensors traffic-measuring sensors placed along freeway network (Northern California) 44

Mean days to failure as many as 40% sensors fail a given day separating sensor failure from events of interest is difficult multiple change point detection problem 45

Set-up and underlying assumptions m sensors labeled by U = {u 1,...,u m } each sensor u receives sequence of data X t (u) for t = 1, 2,... neighboring and functioning sensors have coorelated measurements a failed sensor s measurement is not with its neighbors each sensor u fails at time λ u π u λ u a priori are independent random variables correlation statistics S n (u, v) satisfies: S n (u, v) f 0 ( u, v), iid n < min(λ u, λ v ) f 1 ( u, v), iid otherwise 46

Distribution of correlation with neighbors Left: A working sensor Right: When failed 47

Graphical model of change points Left: Dependency graph of sensors Right: Graphical model of random variables 48

Detection rules are localized stopping rules detection rule for u, denoted by ν u, is a stopping time, and depends on measurements of u and its neighbors ν u is a prediction of the true λ u more precisely, for any t > 0: {ν u t} σ({s n (u, u ), u N(u), n t}) 49

Performance metrics false alarm rate PFA(ν u ) = P(ν u λ u ). expected failure detection delay D(ν u ) = E[ν u λ u ν u λ u ]. problem formulation: min ν u D(ν u ) such that PFA(ν u ) α. 50

Single change point detection (Shiryaev (1978), Tartakovski & Veeravalli(2005)) optimal rule is to threshold the posterior of λ u given data X ν u (X) = inf{n : Λ n > B α }, where Λ n = P(λ u n X 1,...,X n ) P(λ u > n X 1,...,X n ) ; and B α = 1 α α. this rule satisfies: D(ν u (X)) PFA(ν u (X)) α. log α q 1 (X) + d as α 0. where q 1 (X) = KL(f 1 (X) f 0 (X)), and d is the exponent of the a geometric prior on change point λ u 51

Two sensors case: A naive extension λ u λ v X u Z v Y X Z Y Idea: conditioning on X 1,...,X n and Z 1,...,Z n to compute decision rule for u: ν u (X, Z) σ({x, Z} n 1). Theorem: This approach does not help, i.e., no improvement in asymptotic delay time over the single change point approach: lim D(ν u(x, Z)) = lim D(ν u (X)). α 0 α 0 52

Two sensors case: A naive extension λ u λ v X u Z v Y X Z Y Idea: conditioning on X 1,...,X n and Z 1,...,Z n to compute decision rule for u: ν u (X, Z) σ({x, Z} n 1). Theorem: This approach does not help, i.e., no improvement in asymptotic delay time over the single change point approach: lim D(ν u(x, Z)) = lim D(ν u (X)). α 0 α 0 to predict λ u, need to also use information given by Y 52-a

Localized stopping time with message exchange Main idea: u should use information given by shared link Z only if its neighbor v is also functioning By combining with information given by Z, delay time is reduced: is strictly greater than D(ν u (X)) log α q 1 (X) + d D(ν u (X, Z)) log α q 1 (X) + q 1 (Z) + d. 53

Localized stopping time with message exchange Main idea: u should use information given by shared link Z only if its neighbor v is also functioning but u never knows for sure if v works or fails, so... u should use information given by shared link Z only if sensor u thinks neighbor v is also functioning u thinks neighbor v is functioning if v thinks so, too, using information given by Z as well as Y X u Z v Y message 54

The protocol: Continue... each sensor uses all links (variables) from sensors that are not yet declared to fail if a sensor v raises a flag to declare that it fails, then v broadcasts this information to its neighbor(s), who promptly drop v from the list of their neighbors 55

The protocol: Continue... each sensor uses all links (variables) from sensors that are not yet declared to fail if a sensor v raises a flag to declare that it fails, then v broadcasts this information to its neighbor(s), who promptly drop v from the list of their neighbors Formally, for two sensors: stopping rule for u, using only X: ν u (X) stopping rule for u, using both X and Z: ν u (X, Z) similarly, for sensor v: ν v (Y ) and ν v (Y, Z) then, the overall rule for u is: ν u (X, Y, Z) = ν u (X, Z)I(ν u (X, Z) ν v (Y, Z))+ max(ν u (X), ν v (Y, Z))I(ν u (X, Z) > ν v (Y, Z)). 55-a

Performance bounds: theorem (Rajagopal et al (2008)) detection delay for u satisfies, for some constant δ α (0, 1): D( ν u ) D(ν u (X, Z)(1 δ α ) + D(ν u (X))δ α. δ α = probability that u s neighbor declares fail before u for sufficiently small α there holds: D( ν u ) < D(ν u (X)) 56

Performance bounds: theorem (Rajagopal et al (2008)) detection delay for u satisfies, for some constant δ α (0, 1): D( ν u ) D(ν u (X, Z)(1 δ α ) + D(ν u (X))δ α. δ α = probability that u s neighbor declares fail before u for sufficiently small α there holds: D( ν u ) < D(ν u (X)) false alarm rate for u satisfies: PFA( ν u ) < 2α + ξ( ν u ). ξ( ν u ) is termed confusion probability: probability that u thinks v has not failed, while in fact, v already has: ξ( ν u ) = P( ν u ν v, λ v ν u λ u ). 56-a

Performance bounds: theorem (Rajagopal et al (2008)) detection delay for u satisfies, for some constant δ α (0, 1): D( ν u ) D(ν u (X, Z)(1 δ α ) + D(ν u (X))δ α. δ α = probability that u s neighbor declares fail before u for sufficiently small α there holds: D( ν u ) < D(ν u (X)) false alarm rate for u satisfies: PFA( ν u ) < 2α + ξ( ν u ). ξ( ν u ) is termed confusion probability: probability that u thinks v has not failed, while in fact, v already has: ξ( ν u ) = P( ν u ν v, λ v ν u λ u ). under certain conditions, ξ( ν u ) = O(α). 56-b

Two-sensor network: Effects of message passing X-axis: Ratio of informations q 1 (X)/q 1 (Z) Y-axis: Detection delay time Left: evaluated by simulations Right: predicted by our theory 57

Number of sensors vs Detection delay time Fully connected network: Left:α =.1 Right: α = 10 4 (theory predicts well!) 58

False alarm rates Fully connected network: Left: Selected false alarm rate vs. actual rate Right: Number of sensors vs. actual rate 59

Effects of network topology (and spatial dependency) Grid network Left: Number of sensors vs. actual false alarm rate Right: Number of sensors vs. actual detection delay 60

Summary aggregation of data to make a good decision toward the same goal how to learn jointly local messages and global detection decision subject to the distributed constraints of system? decision-making with multiple and spatially dependent goals how to devise efficient message passing schemes that utilize statistical dependency? tools from convex optimization, statistical modeling and asymptotic analysis 61