Part 3 Introduction to statistical classification techniques

Similar documents
Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Pattern Recognition 2014 Support Vector Machines

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Simple Linear Regression (single variable)

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Computational modeling techniques

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Tree Structured Classifier

The blessing of dimensionality for kernel methods

Support-Vector Machines

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Distributions, spatial statistics and a Bayesian perspective

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Checking the resolved resonance region in EXFOR database

IAML: Support Vector Machines

1 The limitations of Hartree Fock approximation

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Chapter 3: Cluster Analysis

, which yields. where z1. and z2

The standards are taught in the following sequence.

Comparison of hybrid ensemble-4dvar with EnKF and 4DVar for regional-scale data assimilation

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Differentiation Applications 1: Related Rates

A Matrix Representation of Panel Data

Statistical classifiers: Bayesian decision theory and density estimation

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Hypothesis Tests for One Population Mean

5 th grade Common Core Standards

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

What is Statistical Learning?

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Exam #1. A. Answer any 1 of the following 2 questions. CEE 371 March 10, Please grade the following questions: 1 or 2

Exam #1. A. Answer any 1 of the following 2 questions. CEE 371 October 8, Please grade the following questions: 1 or 2

Lecture 8: Multiclass Classification (I)

Linear Classification

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

B. Definition of an exponential

7 TH GRADE MATH STANDARDS

Drought damaged area

Comparing Several Means: ANOVA. Group Means and Grand Mean

CS 109 Lecture 23 May 18th, 2016

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

NUMBERS, MATHEMATICS AND EQUATIONS

3. Classify the following Numbers (Counting (natural), Whole, Integers, Rational, Irrational)

Lim f (x) e. Find the largest possible domain and its discontinuity points. Why is it discontinuous at those points (if any)?

Resampling Methods. Chapter 5. Chapter 5 1 / 52

EE247B/ME218: Introduction to MEMS Design Lecture 7m1: Lithography, Etching, & Doping CTN 2/6/18

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

1 PreCalculus AP Unit G Rotational Trig (MCR) Name:

Inference in the Multiple-Regression

Homology groups of disks with holes

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

Public Key Cryptography. Tim van der Horst & Kent Seamons

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Elements of Machine Intelligence - I

LCAO APPROXIMATIONS OF ORGANIC Pi MO SYSTEMS The allyl system (cation, anion or radical).

AP Statistics Notes Unit Two: The Normal Distributions

Artificial Neural Networks MLP, Backpropagation

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Trigonometric Ratios Unit 5 Tentative TEST date

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

ELT COMMUNICATION THEORY

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Equilibrium of Stress

Localized Model Selection for Regression

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Interference is when two (or more) sets of waves meet and combine to produce a new pattern.

EDA Engineering Design & Analysis Ltd

Activity Guide Loops and Random Numbers

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Physical Layer: Outline

ECE 2100 Circuit Analysis

On Topological Structures and. Fuzzy Sets

Early detection of mining truck failure by modelling its operation with neural networks classification algorithms

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

More Tutorial at

A Few Basic Facts About Isothermal Mass Transfer in a Binary Mixture

Sequential Allocation with Minimal Switching

Aristotle I PHIL301 Prof. Oakes Winthrop University updated: 3/14/14 8:48 AM

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Assessment Primer: Writing Instructional Objectives

Transcription:

Part 3 Intrductin t statistical classificatin techniques Machine Learning, Part 3, March 07 Fabi Rli

Preamble ØIn Part we have seen that if we knw: Psterir prbabilities P(ω i / ) Or the equivalent terms P(ω i ) p( / ω i ) And we knw the lss matri Λ Nte: in statistics D is ften called the sample f size n drawn frm the distributin p(). In pattern recgnitin the term sample is usually used fr the single pattern i. ØThe minimum risk thery allws us t design the ptimal classifier (that minimizes the classificatin risk) fr the task at hand ØHwever, in practical cases, we never knw all this infrmatin ØThe nly infrmatin that we usually have is a data set D (called design r training data set) D = [,,., n ] i = ( i, i,., id ) i=,..,n i belnging t ne f the c classes ( i ε ω j j=,,c) ØPatterns i are drawn independently accrding t p( i / ω j ) Machine Learning, Part 3, March 07 Fabi Rli

Classificatin techniques ØIf we knw the classes t which the patterns i f the design/training set belng t, we speak f Supervised Classificatin Further infrmatin, beynd the data sert D, that we can have are: Ø We can knw the parametric mdel ( parametric frm ) f the distributin p(/ω i ), s that we can use the Parametric Techniques Ø If we knw nthing abut the distributin p(/ω i ), we are bliged t use the s called Nn-Parametric Techniques ØParametric Techniques: we knw the parametric frm f the distributin p(/ω i ), fr eample, we knw that the distributin is Gaussian ØNn-Parametric Techniques: we knw nthing abut the distributin, and we are nt able t get any infrmatin with an unsupervised analysis. Nte that we are assuming that estimating prirs P(ω i ) is an easy prblem, assumptin that is ften but nt always true. ØHere we are disregarding the csts f classificatin. The reasn is that the chice f cst values is a prblem-dependent issue, very little can be said in general abut this chice. Machine Learning, Part 3, March 07 Fabi Rli 3

Classificatin Parametric Techniques ØWe knw, r we assume, a parametric frm f the distributins p(/ω i ). ØThe main prblem is then t estimate the parameters f the mdel (e.g., the mean value and the variance f the Guassian mdel) ØWe discuss in detail these techniques in Part 4. ØThe estimate f the parameters is dne using the data set D, r a subset f it mre ften (t avid a prblem called ver-fitting ). ØHw can we assume a gd parametric mdel f the distributins p(/ω i )? In the practical applicatins we have tw pssibilities t d that: We assume different parametric mdels, we cmpute the parameters fr each mdel, then we cmpare the errrs f the mdels and select the best We use Unsupervised Classificatin Techniques (we see basic cncepts later in Part 9) t gain sme knwledge f the parametric frm f the p(/ ω i ). Unsupervised classificatin Using the data set D we try t gain sme knwledge abut p(/ω i ) (e.g., we discver that it is made up f tw clusters, i.e., it is the sum f tw Gaussian distributins) Machine Learning, Part 3, March 07 Fabi Rli 4

Classificatin Nn-Parametric Techniques ØWe knw nthing f the distributin p(/ω i ), and we are nt able t gain knwledge with an unsupervised analysis. ØWe use techniques (Part 5) that allw t estimate the densities p(/ω i ), r the prsterir prbabilities P(ω i /), using the data set D. ØNn-parametric techniques are aimed t estimate directly the density functins p() Machine Learning, Part 3, March 07 Fabi Rli 5

Eample f parametric techniques in bimetrics In bimetric recgnitin parametric techniques can be used t mdel genuine and impstr distributins Parametric techniques smetimes prvides perfrmances lwer than the nes f nn parametric techniques Machine Learning, Part 3, March 07 Fabi Rli 6

Linear discriminant functins ØIn sme cases it can be mre effective t assume a parametric frm f the discriminant functins g i (), i=,..,c, instead f a parametric frm f the p(/ω i ) (We discuss this in Part 6). ØFr eample, t assume a linear frm f the discriminant functins g i () In sme cases linear functins allw t disciminate well classes that wuld be difficult t mdel by cmputing the distributins p(/ω i ). It is wrth nting that, in the end, what we want t d in many cases is just t classify, nt mdelling the p(/ω i )! Even if a linear discriminat functin des nt prvide the ptimal slutins, hwever the errr rate can be acceptable fr the task at hand! Machine Learning, Part 3, March 07 Fabi Rli 7

Design f classifier: basic design cycle We have just a design set D = [,,., n ] Unsupervised analysis D yu knw the frm f p()? NO Nn Parametric techniques YES Parametric techniques Split D int 3 sets: training, validatin, and test set Use training+validatin sets t estimate parameters Split D int 3 sets: training, validatin, and test set Use the validatin set t estimate parameters, and training set t train classifier Use test set t estimate errr prbability We see later that nnparametric techniques have sme parameters t be estimated as well! Machine Learning, Part 3, March 07 Fabi Rli 8

Sme ntable cncepts: feature (re)scaling ØFeatures used t characterize patterns are usually linked t physical measurements which have different scales. Given samples in D, feature scales can be very different (e.g, height in meters and weight in kg). This is due t nn-hmgenus physical measurements r the intrinsic scales f different features. ØSlutin: nrmalizatin, (re)scaling f features. The nrmalizatin peratin can be regarded as a functin h j applied t feature that takes as input the riginal feature value ij, and utputs the rescaled(nrmalized) feature value ij = h j ( ij ), with h j being the nrmalizatin functin (j =,,..., d). Machine Learning, Part 3, March 07 Fabi Rli 9

Sme nrmalizatin functins Given D = [,,., n ], i = ( i, i,., id ) i=,..,n, nrmalizatin functins h j widely used are the fllwings: We divide the feature ij by maimum value (ver D): ij ij =, j,ma = ma kj k=,,..., n j,ma Divide by maimum range: ij = ij j,ma j,min j,min [0,], j,ma j,min Divide by standard deviatin f feature ij : { } kj ( m ) = ij m m j E j k ij =, σ σ j = E j kj j k { } Machine Learning, Part 3, March 07 Fabi Rli 0 = = ma k=,,..., n min k=,,..., n mˆ j σˆ j = = kj kj n N k= N kj n ( ) kj mˆ j k=

Remarks n nrmalizatin The third nrmalizatin methd (divisin by standard deviatin) is useful, fr eample, when feature distributin is Gaussian. If feature ij has a Gaussian distributin, the nrmalized feature ij has a nrmalized Gaussian distributin. Nrmalizatin must be dne using all the patterns available in D and fr each feature separately. Hereafter, we assume that all the features used have been prperly nrmalized, and therefre we mit the ape in ij. Machine Learning, Part 3, March 07 Fabi Rli

Sme ntable cncepts: Separatin f classes Definitin f separated class: In a bi-dimensinal feature space (d=), a class is called separated if a curve (clsed r pen) eits s that all the samples f that class lies n the same side f the curve. In a d-dimensinal feature space we have hyper-curves. Tw separated classes can be: Linearly separable, if the curve that separates the tw classes is a linear functin (fr d =, the curve is a straight line); Nn linearly separable, the separatin needs nn-linear curves. Nte that the separatin demands that tw patterns belnging t different classes d nt have the same feature values! S we are speaking f deterministic separatin! Machine Learning, Part 3, March 07 Fabi Rli

Ntable cncepts: Multi-mdal classes Ø A data class is multimdal if it cntains clusters f patterns which are linearly separable r it has different peaks f the density functin. Esempi ω (a) (b) (c) ω ω ω ω ω (a) (b) (c) ω (a) tw linearly separable classes, (b) e (c) tw classes nn linearly separable. The class ω in (c) is bimdal. In (a) and (c) statistical methds wrk well, the case (b) is much mre difficult. Machine Learning, Part 3, March 07 Fabi Rli 3

A ntable cncept: gemetrical cmpleity f classes Characteristics f a class als dipends n the gemetrical features f the data distributin in the feature space. In particular, if classes have elngated distributins and/r are much verlapped, sme techniques wrk prly. Eample it is difficult t discriminate samples in regins where the tw classes are very verlapped. Each class in the figure have a privileged directin in the feature space. Features have a very high crrelatin (cnditinal crrelatin given the class). Machine Learning, Part 3, March 07 Fabi Rli 4

Crrelatin Cefficient Crrelatin between tw features i ed j can be measured by the cefficient f crrelatin ρ ij (i, j =,,..., d). It is linked t the variance σ ij = E{( i i )( j j )} and the feature variances σ ii and σ jj by: ρ ij = σ jj σ ii If d is the feature number, [ρ ij ] is a squared matri d d, cn ρ ij i, j =,,..., d e ρ ii = (main diagnal) i =,,..., d. feature i and j are crrelated if ρ ij has a high value (e.g. > 0.8). σ ij The analysis f crrelatin can be dne fr each class and fr the whle data set. Machine Learning, Part 3, March 07 Fabi Rli 5

Ntable cncepts: Gemetrical vs. Prbabilistic cmpleity Square 44 Square 00 Prbabilistic cmpleity I must recgnize ne pattern ut f ne millin! Tw very unbalanced classes! The prblem has simple gemetrical features, but it is very hard! Eample f Gemetrical Cmpleity Machine Learning, Part 3, March 07 Fabi Rli 6