Adaptive Multi-Modal Sensing of General Concealed Targets

Adaptive Multi-Modal Sensing of General Concealed argets Lawrence Carin Balaji Krishnapuram, David Williams, Xuejun Liao and Ya Xue Department of Electrical & Computer Engineering Duke University Durham, NC 2778-29 lcarin@ee.duke.edu

Outline Review of semi-supervised statistical classifiers and graph-based prior Extension of semi-supervised classifiers to a multi-sensor setting: Bayesian co-training Active multi-sensor sensing - Selection of those members of the unlabeled set for which acquisition of labels would be most informative - Selection of those members of the unlabeled set for which new sensors should be deployed and new data acquired Concept drift: Adaptive sensing when the statistics of the unlabeled data change or drift from those of the original labeled data Future work

Nomenclature Labeled data: Set of N L feature vectors x n for which the associated label is known, denoted l n {,} for the binary case, thereby yielding the set D = {x l }., n n,n L L n = Unlabeled data: he set of N U feature vectors for which the associated labels are unknown, yielding the set D {x. his the data to be classified. = } = U n n N +,N + N L L U Supervised algorithm: Classifier that is designed using D L and tested on D U Semi-supervised algorithm: Classifier that is designed using D L and D U. Used to estimate labels of D U.

Motivation: FOPEN : target : clutter We typically have far more unlabeled data than labeled examples (N U >>N L ) Seek labels for most-informative feature vectors in D U ypically classification performed on isolated unlabeled examples one at a time We wish to classify members of D U using all information from D U and D L simultaneously

Motivation: SAR-Based Mine Detection remendous amount of unlabeled data, very limited labeled data Classification far easier when placed in context of entire image, vis-à-vis image chips

Motivation: UXO Detection remendous amount of unlabeled data, very limited labeled data Classification far easier when placed in context of entire image, vis-à-vis image chips

Logistic Link Function Design and a Data-Dependent Prior Assume labels {, }, and define the kernel-based function l n N = b y( x w) w K( x, b ) + w = Φ ( x) w n= he probability that x is associated with l n = is expressed as p( l n For the N L labeled examples, we wish to maximize the log-likelihood n = x, w) = σ[ y( x w)] = exp[ Φ ( x) w]/[ + exp( Φ N = L l( w) log P( D, w n= n l n L ) subject to a prior p( wd U ) L D on the weights, with the prior dependent on all data (labeled and unlabeled), via a graph-based construct ( x) w)] he classifier weights w MAP are therefore set at w MAP arg max = l w [ ( w D ) + log p( w D D )] L L U

Graph-Based Design of Data-Dependent Prior Use a kernel k(x i,x j ) to define the similarity of x i and x j Note: his kernel is used to define the graph, and need not be the same as that employed in the classifier W ij =k(x i,x j ) is large when the two vectors are similar, e.g., 2 the radial basis function W k( x, x ) = exp[ x x / σ ] ij = i j i j Let the vectors x i constitute the nodes of the graph and let f ( x ) = Φ ( x ) w be a function on the graph i i We seek to minimize the energy function En( f) = 2 i, j W [ f (x ij i ) f( x j)] 2 Large W ij f(x i ) f(x i ) Defining f = { f( x ), f( x2),..., f( xn, it is easy to show that En(f)=f L + N U )} f where is the combinatorial Laplacian =D-W, where the matrix W is defined by W ij and D is a diagonal matrix, the ith element of which is expressed as d i W = j ij

Graph-Based Design of Data-Dependent Prior - 2 Using En(f)=f f, finding f that minimizes En(f) corresponds to a MAP estimate of f from the Gaussian random field density function p( f) = Z β exp[ βen( f)] = Z β exp[ βf f] his gives us a prior on f, which is defined through f ( x ) = Φ ( x ) w, and therefore f=aw with i i A = x N + )] [ Φ( x ), Φ( x2),..., Φ( U NL We therefore have a prior on our model weights w also represented by a Gaussian random field p( w x, x 2,..., x ~ ) = N(, A NU + β NL ) with ~ A = A A We now have a prior on the weights w applied to the labeled data D L, accounting for the inter-relationships between all data D L and D U he MAP solution for w, w MAP arg max = l w [ ( w D ) + log p( w D D )] L L U, solved for via an EM algorithm

Graph-Based Design of Data-Dependent Prior: Intuition he Gaussian field prior on f essentially prefers functions which vary smoothly across the graph as opposed to functions which vary rapidly In our case we prefer to have the posterior probability of belonging to class + vary smoothly across the neighboring vertices of the graph

Decision surface based on labeled data (supervised) 3 2 2 3 2.5 2.5.5.5.5 2 Decision surface based on labeled & unlabeled data (semi-supervised)

UXO Sensing JPG V

Extension of Graph to Multiple Sensors Graph for feature vectors from Sensor One Items for which features available from both sensors Graph for feature vectors from Sensor wo Assume M sensors are available, and S n represents the subset of sensors deployed for item n For item n we have features ( m) { xn, m S n } Build a graph-based prior for feature vectors from each of the individual sensor types How do we connect features from multiple sensors when available for a given item n? o simplify the subsequent discussion, we assume that we have only two sensors

Bayesian Co-raining Graph for feature vectors from Sensor One Items for which features available from both sensors Graph for feature vectors from Sensor wo o connect multiple feature vectors (graph nodes) for a given item, we impose a statistical prior favoring that the multiple feature vectors yield decision statistics that agree () () Let f ( xn ) = Φ ( xn ) w represent the decision function for the nth item, with data (2) (2) from Sensor One, and f ( x ) = Φ ( x w is defined similarly for Sensor wo 2 n 2 n ) 2 Let D B represent those elements for which data is available from both sensors, we seek parameters w and w 2 that satisfy min () (2) 2 min () (2) 2 { σ[ f( xn )] σ[ f2( xn )]} [ f( xn ) f2( xn )] w, w w, w 2 n D B = w 2 n D B min, w 2 w Cw

Multi-Sensor Graph-Based Prior min he condition w Cw may be expressed in terms of a Gaussian random field w w, 2 prior, the likelihood of which we wish to maximize he cumulative graph-based multi-sensor prior on the model weights is ~ ~ log p( w, w2 λb, λ, λ2) = log p( w λb, λ, λ2) = λbw Cw + λw Aw + λ2w2 A2w2 + K Hyper-parameters that control relative importance of terms Co-training prior based on multiple views of same item Smoothness prior within Sensor One weights Smoothness prior within Sensor wo weights A Gamma hyper-prior is used for ( λ, λ, λ ) p B 2

otal Likelihood to be Maximized p( w D L, D U ) N L n= l n -l n = { σ[ Φ ( xn) w]} { σ[ Φ ( xn) w]} + p( w, w 2 λ B, λ, λ 2 )p( λ B, λ, λ 2 3 ) dλ Driven by labeled data from Sensor One, Sensor wo or both Graph-based prior based on labeled and unlabeled data from Sensors One and wo We solve for the weights in a maximum-likelihood sense, via an efficient EM algorithm with λ, λ λ serving as the hidden variables B, 2 Once the weights w are so determined, the probability that example x is associated with label l n is expressed as p( l n x, w l n -l n ML ) = σ[ Φ ( x) wml]} { σ[ Φ ( x) wml]}

Features of Bayesian Semi-Supervised Co-raining Almost all previous fusion algorithms have assumed that all sensors are applied on each item of interest Using Bayesian co-training, a subset of sensors may be deployed on any given item Placed within a semi-supervised setting, whereby context and changing statistics are accounted for by utilizing the unlabeled data Sensor Sensor 2 Labeled Data Unlabeled data

Semi-Supervised Multi-Sensor Processing Example Results: WAAMD Hyperspectral & SAR data Sensor Sensor 2 Labeled Data Unlabeled data Hyperspectral X-band SAR NVESD collected data from Yuma Proving Ground, several different environments Simple feature extraction performed on hyper-spectral & SAR data Labeled examples selected randomly, classification performance presented for remaining unlabeled examples

N L =386 N U =469

N L =66 N U =477

Discussion We have demonstrated integration of AHI (hyperspectral) and Veridian (SAR) data, and improved performance with the semi-supervised classifier, when N U >>N L Question: In this example, is the SAR busting the hyperspectral performance, vice-versa, or both? We have two couple graphs, one each for the SAR and hyperspectral, with these linked via the co-training prior We can use the sub-classifier associated with each of these individual graphs to examine performance with or without the other sensor (e.g., SAR alone vis-à-vis the SAR classifier performance when also using information from the hyper-spectral sensor)

Hyper-Spectral alone vis-à-vis Hyper-Spectral Using SAR Information AHI #B62R2344952946rFlw & Veridian #B3R48rFvv.9.8 Probability of Detection.7.6.5.4.3 2. est on unlabeled AHI data.2/.2/.8 of /2/Overlap labeled (L/U=57/35) 99testing data including 73mines.2 Supervised (AHI only). Semi-supervised (AHI only) Semi-supervised (AHI and SAR).2.4.6.8 Probability of False Alarm

AHI #B62R2344952946rFlw & Veridian #B3R48rFvv Probability of Detection SAR alone vis-à-vis SAR Using Hyper-Spectral Information.9.8.7.6.5.4.3.2 2. est on unlabeled SAR data.2/.2/.8 of /2/Overlap labeled (L/U=57/35) 28 testing data including mines Supervised (SAR only). Semi-supervised (SAR only) Semi-supervised (AHI and SAR).2.4.6.8 Probability of False Alarm

Active Learning: Adaptive Multi-Modality Sensing Sensor Sensor 2 Labeled Data Unlabeled data Q: Which of the unlabeled data (from Sensor, Sensor 2, or both) would be most informative if the associated label could be determined (via personnel or auxiliary sensor) Q2: For those examples for which only one sensor was deployed, which would be most informative if the other sensor was deployed to fill in missing data A: heory of optimal experiments

ype Active Learning: Labeling Unlabeled Data he graph-based prior does not change with the addition of new labeled examples We assume that the hyper-parameters λ B, λ, and λ 2 do not change with the addition of one new labeled example he statistics of the model weights are approximated (Laplace approximation) as p( w D L, D U ) ~ N( w wˆ, H ) where the precision matrix (Hessian) is expressed H = 2 [ log p( w DL, DU )] o within an additive constant the entropy of a Gaussian process is 2 log H

ype Active Learning: Labeling Unlabeled Data Expected decrease in entropy on w when the label is acquired for x * I( w; l* ) = H ( w) E{ H ( w l* )} = (/ 2) log[ + p* ( p* ) x* H x* ] p ( l* = x*, wˆ ) Error bars in our model with regard to sample x * : Logistic regression Where do we acquire labels? - hose x * for which the classifier is least certain, i.e., ( l = x, ˆ ) ~.5 p * * w - hose x * for which the logistic-regression model has largest error bars

ype 2 Active Learning: Deploying Sensors to Fill-In Missing Data o simplify the discussion, assume we have two sensors, S and S 2 Let the feature vector measured by S for the ith item (target/non-target), with defined similarly Using a Laplace approximation, we have with () x i (2) x j ], ˆ [ ), ( p U L U L U L w D D w N σ σ + σ σ + + λ + λ λ = = = (2) (2) (2) 2 2 (2) 2 () () () () B B 2 2 U L ) ( ) ( ) ( ) ( i i i L i i i i i L i i x x x w x w x x x w x w Graphs from individual sensors Co-training graph Labeled data from sensors S & S 2

ype 2 Active Learning L () () L2 () () x (2) (2) L U = λ +λ2 2 +λb B + σ( ) σ( ) i xi w xi w xi + σ( w2xi ) σ( w2xi ) i= i= x (2) i x (2) i We only deploy sensors to add data to the unlabeled data, and therefore only λ + λ2 2 + λb B changes with the addition of the new data (i.e., we improve the quality of the graph-based prior) L U We desire the expected change in the determinant of, but to make computationally tractable we actually compute E{ } L U Use Gaussian-mixture models, based on all data, to estimate needed density functions () (2) (2) () p( x x ) and p( x x )

Active Selection of Labeled Examples from Unlabeled Data

Deployment of Sensor A to Fill In Missing Data from Sensor B Consider WAAMD Data, 2 Potential Fill Ins Sensor A Sensor B Regions Where Data Potentially Filled In

AHI #B62R23439272535rFlw & Veridian #B3R48rFvv.9.8.7.6 Pd.5.4.3.2 97 data missing each sensor labeled data Active querying 65 Random querying 65..2.4.6.8 Pfa

AHI #B62R23439272535rFlw & Veridian #B3R48rFvv.9.8.7.6 Pd.5.4.3.2 97 data missing each sensor labeled data Active querying 97 Random querying 97..2.4.6.8 Pfa

AHI #B62R23439272535rFlw & Veridian #B3R48rFvv.9.8.7.6 Pd.5.4.3.2. Active querying 29 Random querying 29 97 data missing each sensor labeled data.2.4.6.8 Pfa

What is Concept Drift? Sensor Sensor 2 Labeled Data Unlabeled data A fundamental assumption in statistical learning algorithms is that all data are characterized by the same underlying statistics. For M sensors and label l n : p( x (), x (2),...,x ( M ) l n In sensing problems, background and/or sensor conditions may change, and therefore there may manifest changes in the underlying statistics, for the labeled and unlabeled data: Concept Drift ) Can we design algorithms that adapt as the underlying concepts change, such that we can still utilize all available data?

Concept Drift Assume we have unlabeled data from environment of interest E D ( m) U = { xn, m Sn} n=, NU for which we seek to estimate the unknown associated labels l n In addition, assume we have labeled data from a related but different environment Ê Dˆ ˆ ( m) L = xn, l n), m Sn} n= NU +, NU + NL {(ˆ ( m) where (ˆ x ˆ n, l ) are data and associated label from the previous environment (sensor m) n Environment Ê Environment E Feature 2 Feature 2 Feature Feature

Concept Drift Feature 2 Environment Ê Feature 2 Environment E Feature Feature Problem: We have labeled landmine/fopen/underground-structure data from one environment, which we d like to apply to a new but related environment Define the probabilities p( l ˆ ˆ n l n, D U, DL ) for which we impose a Dirichlet conjugate prior, which allows us to incorporate prior knowledge for p( l lˆ) In subsequent discussion we consider a single sensor (M=) and binary labels (l=,) to simplify the discussion

Concept Drift 2 Feature 2 Environment Ê Feature 2 Environment E Feature Feature p( l ˆ n l = n =, D, ˆ U D L ) p( w D U, Dˆ L ) N L = {ˆ l n= n log[ σ( w xˆ n ) µ n + σ( w xˆ n )( µ n )] + ( lˆ n ) log[ σ( w xˆ n ) ν n + σ( w xˆ n )( ν n )]} + λw ~ Aw + K p( l ˆ n l = n =, D, ˆ U D L ) Graph-based smoothness prior on unlabeled data

Concept Drift 3 Feature 2 Environment Ê Feature 2 Environment E Feature Feature Weights determined as before using an EM algorithm, with the parameters µ n and ν n playing the role of hidden variables (Dirichlet prior employed) Can again do active learning: - Of the unlabeled examples, which would be most informative if they label could be acquired Solved as before within a Laplace approximation (Hessian, etc.)

7 Illustrative oy Example logit-select a ctive: iteration 6 5 4 Original labeled data.5 3 2 Concept Drift -.5 - -2 - -3 2 3 4 5 6 7 8 9 Only the Initial Data are Labeled -.5

Illustrative oy Example - logit-select a ctive: iteration 7 6 5.5 4 3 2 -.5 - -2 - -3 2 3 4 5 6 7 8 9 First Active Labeling & Classifier Refinement

Illustrative oy Example - 2 logit-select a ctive: iteration 2 7 6 5.5 4 3 2 -.5 - -2 - -3 2 3 4 5 6 7 8 9 Second Active Labeling & Classifier Refinement

Illustrative oy Example - 3 logit-select a ctive: iteration 4 7 6 5.5 4 3 2 -.5 - -2 - -3 2 3 4 5 6 7 8 9 Fourth Active Labeling & Classifier Refinement

Example on Real Data: UXO Detection remendous amount of unlabeled data, very limited labeled data Classification far easier when placed in context of entire image, vis-à-vis image chips

Results on JPGV-V and Badlands Data EMI and magnetometer sensors Labeled data: JPGV-V (6 UXO + 88 clutter) Unlabeled data: Badlands (57 UXO + 435 clutter) Kernel: Direct kernel (weights applied to feature components) UXO Features = [log(m p ), log(m z ), depth]. Each feature is normalized to zero mean and unitary variance) Five items actively selected for labeling from Badlands data

.9 Account for drift in statistics ROC: test data includes actively labeled data.8 probability of UXO detection.7.6.5.4.3.2 rain classifier on labeled JPG-V and five actively selected items From Badlands rain classifier on labeled JPG-V data only logit-push-active (C=e-5), 5 primary data logit-active, 5 primary data logit. 5 5 2 25 3 35 4 number of excavations

Future Work he active learning for labeling and acquisition of new data ( fill in ) is thus far myopic. We believe this can be extended to non-myopic active learning. We now have multiple actions one may take: (i) perform labeling of unlabeled data or (ii) deploy a given sensor to fill in missing data on a given target. hese will now be integrated into a general sensor and HUMIN management structure, accounting for deployment costs We need to extend the multi-sensor fill-in active learning to the concept-drift framework Have performed ML (EM algorithm) estimation of model parameters. Now employing ensemble/variational techniques to extract full posterior on model parameters (initial work discussed during Workshop slides available) Deploy algorithms on actual hardware (robots), in collaboration with Quantum Magnetics, Inc.