Analysis Techniques Multivariate Methods Harrison B. Prosper NEPPSR 007
Outline hintroduction hsignal/background Discrimination hfisher Discriminant hsupport Vector Machines hnaïve Bayes hbayesian Neural Network hdecision Trees
Introduction Most interesting data are intrinsically multivariate: = (,,..., d ). Eample: single top data at the Tevatron is of dimension d = d = 4 pp tb pp tqb 3
p p tt l + Eample: DØ 995 Top Discovery jets Aplanarity 0.3 0. 0. 0 0.3 Data 05 pb - tt - MC 7 fb - Multijet 700 pb - W+4jets MC385 pb - 0. 0. 0 0 00 00 300 400 0 00 00 300 400 H T (GeV) 4
hpoints to note: Introduction 3 hintuition based on analysis in one dimension often fails badly for spaces of high dimension. hnon-linear systems are qualitatively different from linear ones. hone should distinguish between the problem to be solved, which generally falls within a broad category of similar problems, from the algorithm to solve it. 5
Signal/Background Discrimination Signal density p(, S) = p( S) p(s) density p () Background density p(, B) = p( B) p(b) y = 0 y = 0 Goal: Minimize the misclassification rate 6
Signal/Background Discrimination Signal/background discrimination is optimal, that is, the error rate is minimized, when done using the Bayes discriminant r = p ( S) p( S) p ( B) p( B) or a function thereof, such as the probability p(s ) of the signal S given r p( S) p( S) p( S ) = = + r p( S) p( S) + p( B) p( B) 7
Signal/Background Discrimination 3 In practice, it is sufficient to use the discriminant D ( ) = p ( S) p ( S) + p ( B) because the relationship between P(S ) and D() is one-to-one P( S ) = p p( S) + ( S) ( D ( ) ) ( ) p B 8
hfisher Discriminant h Support Vector Machines h Naïve Bayes hbayesian Neural Networks hdecision Trees 9
Fisher Discriminant w + b > 0 r = p ( S) p( S) p ( B) p( B) Take p( *) to be a Gaussian, use y = ln r, and drop the constant g( μ, Σ) = ln + g μ, Σ y w b ( ) w + b < 0 w Eercise 9: Show that y is linear for d-dimensional Gaussians with equal covariance matrices 0
Support Vector Machines This is a powerful, and relatively new, generalization of the Fisher discriminant (Boser, Guyon and Vapnik, 99). Basic Idea Data that are non-separable in d-dimensions have a higher probability of being separable if mapped into a space of higher dimension d F : R R Use a hyper-plane to partition the data H f ( ) = w h( ) + b
Support Vector Machines Consider separable signal and background data Suppose that: green plane given by w. + b = 0 red plane given by w. + b =+ blue plane given by w. + b =- subtract blue from red w.( - ) = w and normalize the vector w ŵ.( - ) = / w
Support Vector Machines The quantity m = ŵ.( - ), the distance between the red and blue planes, is called the margin. The best separation occurs when the margin is as large as possible. The plane that lies midway between the red and blue planes is called the optimal separating hyper-plane plane w Note: because m ~ / w, maimizing the margin is equivalent to minimizing w 3
Support Vector Machines It is convenient to label the red dots y = + and the blue dots y = -. For separable data the task is to minimize w subject to the constraint y i.(w. i + b), i = N w That is to minimize N Lwb (,, α) = ( ) w αi yi w i + b i= where the α > 0 are Lagrange multipliers, one for each constraint. 4
Support Vector Machines When L(w,b,α) is minimized with respect to w and b, the Lagrangian L(w,b,α) can be transformed to its dual form N N N E( α) = α i αiα j y i= i= j= i y j ( i j ) In general, of course, data are not separable and the constraints have to be weakened y i.(w. i + b) ξ i by introducing so-called slack variables ξ i. 5
Support Vector Machines Once the minimum has been found, the only non-zero coefficients α are those corresponding to points on the red and blue planes: that is, to the support vectors. w 6
Support Vector Machines We work, however, not in the space {} but in the higher dimensional space {h()} to which {} is mapped. Each vector in {h()} is of the form h() = [h (), h (), h H ()] and we can write N N N E( α) = α αα y y h( ) h( ) i i j i j i j i= i= j= Important: The scalar product structure allows the use of kernels K( i, j ) = h( i ).h( j ) to perform both the mapping and simultaneously the scalar product, efficiently, even if the space is of infinite dimensions! 7
8 Eample Eample Mapping From R Mapping From R R 3 ),, ( ),, ( ), ( : 3 z z z h = z z z 3 ), ( ) ( ),, ( ),, ( ) ( ) ( y k y y y y y y h h = = = Here we are mapping from -D -space to3-d z-space
Naïve Bayes Each density p(.) is approimated by pˆ( ) = n i= q( i ) where q( i.) is the projection of the d-dimensional density p(.) onto ais i ; that is, the q(.) are -D histograms, or, better still, -D KDEs q( i ) = ) p( { : } j j i d 9
Naïve Bayes The naïve Bayes estimate of D() is then given by D ( ) = p ˆ( S) pˆ( S) + pˆ( B) In spite of its name, this method often works surprisingly well. 0
Bayesian Neural Networks We define a Bayesian neural network by the average f( D) = f(, w) p(w D) dw Likelihood Prior which is approimated by f( D) (/K) Σ k f(, w k ) where the points w k are sampled from p(w D)
Bayesian Neural Networks u, a H d f(, w) = b + v tanh a + u v, b j j ij i j= i= n(, w) n(,w) = + ep[ f A BNN is just an average over neural network functions n(,w) (, w)]
A Simple Eample Signal htqb (muon-channel) Background hwbb (muon-channel) NN Model h(, 5, ) MCMC h500 tqb+ Wbbevents huse last 0 networks in a MC chain of 500. Wbb tqb HT_AllJets_MinusBestJets (scaled) 3
A Simple Eample Dots p(s H T ) = H S /(H S +H B ) H S, H B are -D histograms Curves Individual NNs n(h T, w k ) Black curve < y(h T, w) > HT_AllJets_MinusBestJets 4
An ellipse, called a node, represents a variable on which a cut is to be applied. A line segment represents a cut. A bo, a leaf, represents the conjunction of a sequence of cuts (an If-Then-Else rule) Decision Trees MiniBoone, Byron Roe 5
Decision Trees In a decision tree the feature space is partitioned recursively in accordance with some criterion. Each leaf is a bin associated with a constant value of the function f() being modeled. For classification the values might be - or. 00 00 PMT Hits 0 f() =- f() = B = 0 S = 9 B = 37 S = 4 B = S = 39 f() =- 0 Energy (GeV) 0.4 6
Decision Trees At each node one eamines every variable and chooses with which to partition the space. In this eample, it was determined that it was better to partition with PMT Hits first. At the net node, it proved better to partition using Energy. 00 00 PMT Hits 0 D() = 0.47 D() = 0.98 B = 0 S = 9 B = 37 S = 4 B = S = 39 D() = 0.0 0 Energy (GeV) 0.4 7
Practical Issues Decision Trees htrees tend to be unstable: a small change in the data can result in radically different partitions. htrees are a piece-wise constant approimation to the function f(). This is not too bad for classification, but is a problem where smoothness is needed, for eample, when trying to model a density. However, one can average over many trees (boosting, bagging, forests ). htrees, however, are fast to grow. 8
Summary There is, typically, much more information in the multivariate character of data than in their -D marginal densities (~ -D histograms). Therefore, it makes sense to analyze data using a truly multivariate method, of which many practical and powerful ones eist, usually with free software! Moreover, they can be used together with the powerful and general method of inference based on Bayes theorem. So first learn the mathematics then. challenge the conservative old f***s 9