Analysis Techniques Multivariate Methods

Similar documents
Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Harrison B. Prosper. Bari Lectures

Advanced statistical methods for data analysis Lecture 1

Multivariate statistical methods and data mining in particle physics

Evidence for Single Top Quark Production. Reinhard Schwienhorst

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Methods in Particle Physics

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Machine Learning Lecture 2

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Chapter 6: Classification

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Harrison B. Prosper. Bari Lectures

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Formulation with slack variables

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Polyhedral Computation. Linear Classifiers & the SVM

Machine Learning Lecture 3

Ernest Aguiló. York University

Stat 502X Exam 2 Spring 2014

Curve Fitting Re-visited, Bishop1.2.5

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Linear & nonlinear classifiers

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning Lecture 2

Statistical Methods for LHC Physics

Linear Discriminant Functions

STA 4273H: Statistical Machine Learning

CS145: INTRODUCTION TO DATA MINING

Machine Learning Lecture 7

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Linear Classification and SVM. Dr. Xin Zhang

Pattern Recognition and Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CHAPTER 1-2: SHADOW PRICES

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Evidence for s-channel single top production at DØ

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

ECE521 week 3: 23/26 January 2017

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm: CS 6375 Spring 2015 Solutions

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Introduction to SVM and RVM

The Bayes classifier

Microarray Data Analysis: Discovery

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Dimension Reduction (PCA, ICA, CCA, FLD,

Linear & nonlinear classifiers

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CS 195-5: Machine Learning Problem Set 1

Lecture 10: A brief introduction to Support Vector Machine

10-701/ Machine Learning, Fall

Kernel Methods and Support Vector Machines

From Last Meeting. Studied Fisher Linear Discrimination. - Mathematics. - Point Cloud view. - Likelihood view. - Toy examples

Support Vector Machine (SVM) and Kernel Methods

Lecture 9: PGM Learning

Advanced statistical methods for data analysis Lecture 2

Review: Support vector machines. Machine learning techniques and image analysis

An Introduction to Statistical and Probabilistic Linear Models

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Machine Learning Midterm Exam

Statistical Methods for Particle Physics (I)

Bayesian Learning (II)

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Notes on Discriminant Functions and Optimal Classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification and Support Vector Machine

Course in Data Science

Non-Parametric Bayes

Bayesian Support Vector Machines for Feature Ranking and Selection

Measurement of t-channel single top quark production in pp collisions

Machine Learning, Fall 2009: Midterm

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Concepts in Chemoinformatics

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear Classification

Support Vector Machines for Classification: A Statistical Portrait

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Naïve Bayes classification

Where now? Machine Learning and Bayesian Inference

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

ML (cont.): SUPPORT VECTOR MACHINES

6.867 Machine learning

Transcription:

Analysis Techniques Multivariate Methods Harrison B. Prosper NEPPSR 007

Outline hintroduction hsignal/background Discrimination hfisher Discriminant hsupport Vector Machines hnaïve Bayes hbayesian Neural Network hdecision Trees

Introduction Most interesting data are intrinsically multivariate: = (,,..., d ). Eample: single top data at the Tevatron is of dimension d = d = 4 pp tb pp tqb 3

p p tt l + Eample: DØ 995 Top Discovery jets Aplanarity 0.3 0. 0. 0 0.3 Data 05 pb - tt - MC 7 fb - Multijet 700 pb - W+4jets MC385 pb - 0. 0. 0 0 00 00 300 400 0 00 00 300 400 H T (GeV) 4

hpoints to note: Introduction 3 hintuition based on analysis in one dimension often fails badly for spaces of high dimension. hnon-linear systems are qualitatively different from linear ones. hone should distinguish between the problem to be solved, which generally falls within a broad category of similar problems, from the algorithm to solve it. 5

Signal/Background Discrimination Signal density p(, S) = p( S) p(s) density p () Background density p(, B) = p( B) p(b) y = 0 y = 0 Goal: Minimize the misclassification rate 6

Signal/Background Discrimination Signal/background discrimination is optimal, that is, the error rate is minimized, when done using the Bayes discriminant r = p ( S) p( S) p ( B) p( B) or a function thereof, such as the probability p(s ) of the signal S given r p( S) p( S) p( S ) = = + r p( S) p( S) + p( B) p( B) 7

Signal/Background Discrimination 3 In practice, it is sufficient to use the discriminant D ( ) = p ( S) p ( S) + p ( B) because the relationship between P(S ) and D() is one-to-one P( S ) = p p( S) + ( S) ( D ( ) ) ( ) p B 8

hfisher Discriminant h Support Vector Machines h Naïve Bayes hbayesian Neural Networks hdecision Trees 9

Fisher Discriminant w + b > 0 r = p ( S) p( S) p ( B) p( B) Take p( *) to be a Gaussian, use y = ln r, and drop the constant g( μ, Σ) = ln + g μ, Σ y w b ( ) w + b < 0 w Eercise 9: Show that y is linear for d-dimensional Gaussians with equal covariance matrices 0

Support Vector Machines This is a powerful, and relatively new, generalization of the Fisher discriminant (Boser, Guyon and Vapnik, 99). Basic Idea Data that are non-separable in d-dimensions have a higher probability of being separable if mapped into a space of higher dimension d F : R R Use a hyper-plane to partition the data H f ( ) = w h( ) + b

Support Vector Machines Consider separable signal and background data Suppose that: green plane given by w. + b = 0 red plane given by w. + b =+ blue plane given by w. + b =- subtract blue from red w.( - ) = w and normalize the vector w ŵ.( - ) = / w

Support Vector Machines The quantity m = ŵ.( - ), the distance between the red and blue planes, is called the margin. The best separation occurs when the margin is as large as possible. The plane that lies midway between the red and blue planes is called the optimal separating hyper-plane plane w Note: because m ~ / w, maimizing the margin is equivalent to minimizing w 3

Support Vector Machines It is convenient to label the red dots y = + and the blue dots y = -. For separable data the task is to minimize w subject to the constraint y i.(w. i + b), i = N w That is to minimize N Lwb (,, α) = ( ) w αi yi w i + b i= where the α > 0 are Lagrange multipliers, one for each constraint. 4

Support Vector Machines When L(w,b,α) is minimized with respect to w and b, the Lagrangian L(w,b,α) can be transformed to its dual form N N N E( α) = α i αiα j y i= i= j= i y j ( i j ) In general, of course, data are not separable and the constraints have to be weakened y i.(w. i + b) ξ i by introducing so-called slack variables ξ i. 5

Support Vector Machines Once the minimum has been found, the only non-zero coefficients α are those corresponding to points on the red and blue planes: that is, to the support vectors. w 6

Support Vector Machines We work, however, not in the space {} but in the higher dimensional space {h()} to which {} is mapped. Each vector in {h()} is of the form h() = [h (), h (), h H ()] and we can write N N N E( α) = α αα y y h( ) h( ) i i j i j i j i= i= j= Important: The scalar product structure allows the use of kernels K( i, j ) = h( i ).h( j ) to perform both the mapping and simultaneously the scalar product, efficiently, even if the space is of infinite dimensions! 7

8 Eample Eample Mapping From R Mapping From R R 3 ),, ( ),, ( ), ( : 3 z z z h = z z z 3 ), ( ) ( ),, ( ),, ( ) ( ) ( y k y y y y y y h h = = = Here we are mapping from -D -space to3-d z-space

Naïve Bayes Each density p(.) is approimated by pˆ( ) = n i= q( i ) where q( i.) is the projection of the d-dimensional density p(.) onto ais i ; that is, the q(.) are -D histograms, or, better still, -D KDEs q( i ) = ) p( { : } j j i d 9

Naïve Bayes The naïve Bayes estimate of D() is then given by D ( ) = p ˆ( S) pˆ( S) + pˆ( B) In spite of its name, this method often works surprisingly well. 0

Bayesian Neural Networks We define a Bayesian neural network by the average f( D) = f(, w) p(w D) dw Likelihood Prior which is approimated by f( D) (/K) Σ k f(, w k ) where the points w k are sampled from p(w D)

Bayesian Neural Networks u, a H d f(, w) = b + v tanh a + u v, b j j ij i j= i= n(, w) n(,w) = + ep[ f A BNN is just an average over neural network functions n(,w) (, w)]

A Simple Eample Signal htqb (muon-channel) Background hwbb (muon-channel) NN Model h(, 5, ) MCMC h500 tqb+ Wbbevents huse last 0 networks in a MC chain of 500. Wbb tqb HT_AllJets_MinusBestJets (scaled) 3

A Simple Eample Dots p(s H T ) = H S /(H S +H B ) H S, H B are -D histograms Curves Individual NNs n(h T, w k ) Black curve < y(h T, w) > HT_AllJets_MinusBestJets 4

An ellipse, called a node, represents a variable on which a cut is to be applied. A line segment represents a cut. A bo, a leaf, represents the conjunction of a sequence of cuts (an If-Then-Else rule) Decision Trees MiniBoone, Byron Roe 5

Decision Trees In a decision tree the feature space is partitioned recursively in accordance with some criterion. Each leaf is a bin associated with a constant value of the function f() being modeled. For classification the values might be - or. 00 00 PMT Hits 0 f() =- f() = B = 0 S = 9 B = 37 S = 4 B = S = 39 f() =- 0 Energy (GeV) 0.4 6

Decision Trees At each node one eamines every variable and chooses with which to partition the space. In this eample, it was determined that it was better to partition with PMT Hits first. At the net node, it proved better to partition using Energy. 00 00 PMT Hits 0 D() = 0.47 D() = 0.98 B = 0 S = 9 B = 37 S = 4 B = S = 39 D() = 0.0 0 Energy (GeV) 0.4 7

Practical Issues Decision Trees htrees tend to be unstable: a small change in the data can result in radically different partitions. htrees are a piece-wise constant approimation to the function f(). This is not too bad for classification, but is a problem where smoothness is needed, for eample, when trying to model a density. However, one can average over many trees (boosting, bagging, forests ). htrees, however, are fast to grow. 8

Summary There is, typically, much more information in the multivariate character of data than in their -D marginal densities (~ -D histograms). Therefore, it makes sense to analyze data using a truly multivariate method, of which many practical and powerful ones eist, usually with free software! Moreover, they can be used together with the powerful and general method of inference based on Bayes theorem. So first learn the mathematics then. challenge the conservative old f***s 9